It is usual to distinguish between biological and machine intelligence, and for good reason: organisms have interacted with the world for millennia and survived, machines are a recent human construction, and until recently there was no reason to consider them capable of intelligent behaviour.
Computers changed the picture somewhat, but until very recently artificial intelligence has been tried, and proved disappointing. As computers and programs increased in power and speed a defensive trope developed: a computer will never write a poem/enjoy strawberries/understand the wonder of the universe/play chess/have an original thought.
When IBM’s Deep Blue beat Kasparov there was a moment of silence. The best that could be proffered as an excuse was that chess was an artificial world in which reality was bounded, and subject to rules. At this point, from a game playing point of view, Go with its far greater complexity seemed an avenue of salvation for human pride. When AlphaGo beat Lee Seedol at Go, humans ran out of excuses. Not all of them. Some were able to retaliate: it’s only a game: real problems are more fuzzy than that.
Perhaps. Here is the paper. For those interested in the sex ratio in forefront of technology, there are 17 authors, and I previously assumed that one was a woman, but no, all 17 are men.
AlphaGo used supervised learning. It had some very clever teachers to help it along the way. AlphaGo Zero reinforced itself.
By contrast, reinforcement learning systems are trained from their own experience, in principle allowing them to exceed human capabilities, and to operate in domains where human expertise is lacking.
AlphaGo Fan used two deep neural networks: a policy network that outputs move probabilities and a value network that outputs a position evaluation. The policy network was trained initially by supervised learning to accurately predict human expert moves, and was subsequently refined by policygradient reinforcement learning. The value network was trained to predict the winner of games played by the policy network against itself. Once trained, these networks were combined with a Monte Carlo tree search to provide a lookahead search, using the policy network to narrow down the search to highprobability moves, and using the value network (in conjunction with Monte Carlo rollouts using a fast rollout policy) to evaluate positions in the tree.
Our program, AlphaGo Zero, differs from AlphaGo Fan and AlphaGo Lee12 in several important aspects. First and foremost, it is trained solely by selfplay reinforcement learning, starting from random play, without any supervision or use of human data. Second, it uses only the black and white stones from the board as input features. Third, it uses a single neural network, rather than separate policy and value networks. Finally, it uses a simpler tree search that relies upon this single neural network to evaluate positions and sample moves, without performing any Monte Carlo rollouts. To achieve these results, we introduce a new reinforcement learning algorithm that incorporates lookahead search inside the training loop, resulting in rapid improvement and precise and stable learning. Further technical differences in the search algorithm, training procedure and network architecture are described in Methods.
How shall I describe the new approach? I can only say that it appears to be a highly stripped down version of what had formerly (in AlphaGo Fan and AlphaGo Lee) seemed a logical division of computational and strategic labour. It cuts corners in an intelligent way, and always looks for the best way forwards, often accepting the upper confidence limit in a calculation. While training itself it also develops the capacity to look ahead at future moves. If you could glance back at my explanation of what was going on in those two programs, the jump forwards for AlphaGo Zero will make more sense.
Training started from completely random behaviour and continued without human intervention for approximately three days. Over the course of training, 4.9 million games of selfplay were generated, using 1,600 simulations for each MCTS, which corresponds to approximately 0.4 s thinking time per move.
Well, forget the three days that get all the headlines. This tabula rasa, self-teaching, deep learning, network played 4.9 million games. This is an effort of Gladwellian proportions. I take back anything nasty I may have said about practice makes perfect.
More realistically, few players complete each move in 0.4 secs and can spend a lifetime on a game, amassing 4.9 million contests. Once recalls Byron’s lament:
When one subtracts from life infancy (which is vegetation), sleep, eating and swilling, buttoning and unbuttoning – how much remains of downright existence? The summer of a dormouse.
The authors continue:
AlphaGo Zero discovered a remarkable level of Go knowledge during its selfplay training process. This included not only fundamental elements of human Go knowledge, but also nonstandard strategies beyond the scope of traditional Go knowledge.
AlphaGo Zero rapidly progressed from entirely random moves towards a sophisticated understanding of Go concepts, including fuseki (opening), tesuji(tactics), lifeanddeath, ko (repeated board situations), yose (endgame), capturing races, sente (initiative), shape, influence and territory, all discovered from first principles. Surprisingly, shicho (‘ladder’ capture sequences that may span the whole board)—one of the first elements of Go knowledge learned by humans—were only understood by AlphaGo Zero much later in training.
Here is their website explanations about AlphaGo Zero
The figures show how quickly Zero surpassed the previous benchmarks, and how it rates in Elo rankings against other players.
The team concludes:
Our results comprehensively demonstrate that a pure reinforcement learning approach is fully feasible, even in the most challenging of domains: it is possible to train to superhuman level, without human examples or guidance, given no knowledge of the domain beyond basic rules. Furthermore, a pure reinforcement learning approach requires just a few more hours to train, and achieves much better asymptotic performance, compared to training on human expert data. Using this approach, AlphaGo Zero defeated the strongest previous versions of AlphaGo, which were trained from human data using handcrafted features, by a large margin. Humankind has accumulated Go knowledge from millions of games played over thousands of years, collectively distilled into patterns, proverbs and books. In the space of a few days, starting tabula rasa, AlphaGo Zero was able to rediscover much of this Go knowledge, as well as novel strategies that provide new insights into the oldest of games.
This is an extraordinary achievement. They have succeeded because they have already understood how to build deep learning networks. This is the key advance, one which is extremely complicated to understand and describe, but on which much can be built. As in the human case, studied in 1897 at the dawn of empirical psychology by Bryan and Harter, in their psychological studies of the emerging technology of telegraphy, they have learned what to leave out. That is the joy of competence. Once telegraph operators understood the overall meaning of a message, the details of the Morse codes of individual letters could almost be ignored. Key presses give way to a higher grammar, with a commensurate increase in speed and power of communication. We leap forward by knowing what to skip. In their inspired simplification, this team have taken us a very big step forwards. Interestingly, the better the program, the lower the power consumption. Bright solutions require less raw brain power.
Is it “game over” for humans? Not entirely. Human players will learn from superhumans, and lift their game. It may lead to a virtuous circle, among those willing to learn. However, I think that humans may come to rely on superhumans as the testers of human ideas, and the detectors of large patterns in small things. It may be a historical inflection point. The National Health Service has already opened up its data stores to Deep Mind teams to evaluate treatment outcomes in cancer. Many other areas are being studied by artificial intelligence applications.
When I read their final conclusion, I feel both excitement and a sense of awe, as much as for the insights of the past masters as for the triumph of the new iconoclasts of the game universe. The past masters could not adequately model the future consequences of their insights. Only now have the computing tools become available, though they were long anticipated. The authors are right to say, within their defined domains, that all this was achieved “in the space of a few day, starting tabula rasa” but they would be the first to say, after Babbage, Turing, Shockley and all, that they stood on the shoulders of giants, and then erected new ladders to reach above humankind itself.