Paper Review: Mastering the game of Go without human knowledge

This paper was published by DeepMind to describe how its AI-powered Go program AlphaGo Zero has evolved from its predecessor AlphaGo Lee which was explained in “Mastering the Game of Go With Deep Neural Networks and Tree Search” published in 2016. AlphaGo was already able to beat several human players in the past. For example, AlphaGo Lee defeated Lee Sedol, who was the winner of 18 international titles.

There are several things that set apart AlphaGo Zero from its predecessors. First, its superhuman performance. Without even seeing a single human game, it defeated previous best AlphaGo 100-0 in a match using a single machine with 4TPUs, which AlphaGo Lee was distributed over many machines and used 48 TPUs. Second, it is trained only by self-reinforcement learning. AlphaGo Zero was given just the rules of the game and then it learns by playing matches against itself. Third, it uses a single neural network, rather than a separate policy and vital network. The one thing that surprised me the most was that AlphaGo performed better than its predecessors just after playing 5 million games, which corresponds to 10^9 states, which is just a fraction of the total number of states i.e. 10^170. Most of the early games were played randomly, the AI agent learned very quickly. Whereas AlphaGo Lee was trained for several months, AlphaGo Zero outperformed it after training for just 36 hours.

In AlphaGo Fan there is a policy network that takes as input a representation of the board and outputs a probability distribution over legal moves, and a separate value network that takes as input a representation of the board and outputs a scalar value predicting the expected outcome of the game if play continued from here. AlphaGo Zero combines both of these roles into a single deep neural network that outputs both move probabilities and an outcome prediction value. The input to the network consisting of 17 binary feature planes. At each step, Monte-Carlo Tree Search (MCTS) is executed, which outputs probabilities for each move. These probabilities usually select much stronger moves than the raw move probabilities of the policy head on its own.

There are various strengths of the paper. This paper is that the paper is very clean and standardized. Also, the authors have compared AplhaGo Zero with its predecessors (such as AlphaGo Fan and AlphaGo Lee) in various aspects like performance, architecture and the processing power they require. This gives a very clear idea to the readers on the differences and similarities among them. 

For the weakness, I think the authors should have mentioned the reason for doing something, rather than doing it the other way. For example, they used the Monte Carlo Search Method. There are other options available like min-max search or alpha-beta search methods. This would help the naive readers like myself to understand the paper a bit better.

Link to the paper: https://www.nature.com/articles/nature24270.pdf

Leave a comment