This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration. we introduce a new reinforcement learning algorithm that incorporates look ahead search inside the training loop, resulting in rapid improvement and precise and stable learning.
提升:MCTS是个向前搜索的算法,提供给它的信息越准确,它搜索的效率越高、结果越准确。即使是基于随机策略,MCTS也可以得到有用的信息,毕竟看得见未来总是有优势。只是随机策略的效率会比较低。 数据:所以MCTS总能够基于原策略提供更好的self-play数据。正反馈:神经网络基于MCTS self-play得到的更好的数据,可以使用监督学习的方法提升自身。下一次迭代的时候就可以提供给MCTS更好的策略,头尾相接,形成正反馈。如此循环往复,不断提升。以下是文章节选,说明MCTS是如何对应到policy improvement和policy evaluation,注意形容搜索策略是“much stronger”The AlphaGo Zero self-play algorithm can similarly be understood as an approximate policy iteration scheme in which MCTS is used for both policy improvement and policy evaluation. Policy improvement starts with a neural network policy, executes an MCTS based on that policy's recommendations, and then projects the (much stronger) search policy back into the function space of the neural network. Policy evaluation is applied to the (much stronger) search policy: the outcomes of self-play games are also projected back into the function space of the neural network. These projection steps are achieved by training the neural network parameters to match the search probabilities and self-play game outcome respectively.