Preamble
This month's paper is titled "Accelerating Best Response Calculation in Large Extensive Games". It was written by Michael Johanson and Michael Bowling from the University of Alberta Computer Poker Research Group, Kevin Waugh from Carnegie Melon University, and Martin Zinkevich from Yahoo! Research. It was published in the proceedings of the Twenty-first International Joint Conference on Artificial Intelligence which was held in Barcelona, Spain in July 2011. The paper is available online at
http://poker.cs.ualberta.ca/publications/ijcai2011_accelerated_best_response.pdf.
The topic of this paper has already been discussed on the twoplustwo.com forums with participation by its lead author. Readers of this article may find that discussion to be worth reading.
This research effort addresses several interesting issues in the realm of research into computer strategy, but at its heart asks how exploitable are the current crop of near-optimal computer poker algorithms. That is, if the contestants in the Annual Computer Poker Competition were to play against an opponent who knew their strategy and opposed them with the best possible counter-strategy, how much would they be expected to lose? Such a counter-strategy is the same thing as the "nemesis" strategy discussed in Mathematics of Poker by Chen and Ankenman, and I'll use that term to describe it. Additionally, we may wonder whether the entrants in these contests (and poker algorithms in general) have become less exploitable over time.
Calculating a strategy's expected win or loss rate against its nemesis is very computationally expensive for a game with as many possibilities as real poker. The purpose of the paper as stated in the title is to come up with faster ways to evaluate the exploitability of real poker strategies.
Introduction
First off, to make sure we understand the terminology the authors are using, let us define a "strategy" as a complete set of all plays that a poker player may make in a game based on the circumstances in that game. This can be probabilistic, that is, the strategy might dictate that under a given set of circumstances, a player would raise some percentage of the time and call some percentage of the time.
If we develop new poker strategies, there are generally two ways it might be tested. The first is to acquire other poker strategies and play them against each other in something akin to a tournament and see which one wins. It's not always clear what the best set of ground rules should be to evaluate these algorithms, and it's not always clear what's the best way to interpret the results. There will always be some measure of luck involved. Further, it can easily be the case that while algorithm A beats B and B beats C, that C might beat A. That is, the strength of poker algorithms may not be transitive.
In a situation in which we don't know the optimal strategy for a game, a second method for comparing poker strategies would be to compute their performance against their nemesis opponent. The most straightforward way to evaluate the performance of an algorithm against a nemesis strategy is by a technique known as best response calculation. This notion is a fundamental contribution to game theory made by John Nash, and a pretty good explanation of this concept is available at Wikipedia. For real poker games, a straightforward calculation of this strategy is not feasible, so the research community tends to run tournaments to compare strategies.
One method that software developers often use to reduce complexity in making strategy decisions is what we may call the binning abstraction. Because real poker games are so complex, developers map the full game to a less complex game, solve that game, and then map these results back to the complete game. One of the most popular ways of doing this is to not consider each hand composition individually, but to treat hands that have similar compositions equivalently, thus creating a simplified version of the game that they can map back to the full game. I wrote about these sorts of abstractions in more detail in an article appearing in the 2009 issue of this publication.
Over the years, some of the more popular ways teams refine their poker strategies that have been entered into the Annual Computer Poker Competition (ACPC) include using a finer-grained binning abstraction and superior algorithms for approximating a GTO (game theory optimal) strategy. However, the authors of this paper note that in general increasing the bin count does not always lead to better strategies. Consequently, we might well wonder whether there is any danger that today's state-of-the-art algorithms might be degraded by using too many bins.
In this paper, the authors explain their new technique for computing best response calculations. Because it's faster than the old method, they call it "accelerated best response", and now the reader should understand at least half the title of the paper.
Conventional Best Response
The second section of the paper covers some definitions the authors use. These are critical to understanding what they're discussing, but for the sake of brevity, I'll skip over those.
The subsequent section of the paper explains how the conventional best response algorithm works. Start by examining Figure 1 in the paper. The first diagram displays a game tree for a sample game of one card poker. The paper doesn't say, but it appears that player 1 can check or bet and player 2 can bet/raise, call, or fold. If player 2 bets or raises, player 1 can call or fold. The triangles represent end-states of the game (which end in either a showdown or when one opponent folds). The squares represent the points at which players make their check/bet/call/fold decisions.
Since each player can't see her opponent's hole cards, and often even won't find out what they were even after the game is over, we call poker a "hidden information" game. The authors call one player's (restricted) view of the information available to them his "information set". This is represented in the figure by the P1 and P2 information set trees. Additionally, the public tree shows how the poker game looks to an outside observer.
From one player's perspective, one terminal node in its information set may be an aggregation of multiple nodes in the game tree. The list of probabilities that a hand would end up in each of the information set terminal nodes (each of which may represent multiple terminal nodes in the game tree) is a vector defined as its "reach probability." To determine the value of reaching any terminal node in the information set, one would first multiply each of the reach probabilities by the value of that outcome, and then add all these together.
The authors explain how the algorithm is implemented in the third and fourth paragraphs of this section. It's rather involved, but I can't think of anything to write that makes it easier. Don't be concerned if it takes a couple of passes to sink in.
Accelerated Best Response
In this section, the authors outline their method to reduce the computation time necessary to calculate the value of the game against the nemesis opponent. The paper outlines four methods that are used to speed up this calculation, but I'll focus on just one aspect of one of these methods here.
In an imperfect information game such as poker, one of the aspects that makes calculating the value of various strategies difficult is that the players don't know what cards their opponent holds. They have to consider all the possible holdings and select the option that works best against that player's possible range. This works both ways, though. In the same way that we don't know our opponent's cards, he doesn't know what we're holding. So, once we have determined our opponent's range, then any time this same situation arises, we can be sure that his best response will be the same, and we can reuse this calculation.
Using all the methods outlined in the paper, the authors are able to calculate how exploitable various poker strategies are much faster than via the conventional best response algorithm. This makes it possible to measure the exploitability of a poker strategy for real games where previously this was infeasible.
Results in the Texas Hold'em Domain
The authors measure exploitability in milli(big)blinds per hand. If we divide their numbers in Table 1 by 10, we translate them to big blinds per hundred hands, a metric familiar to online players. As is usual for the CPRG algorithms, they are focusing on heads-up limit hold 'em.
First we examine some trivial strategies. We can use these to make certain that the algorithm is implemented properly. It's trivial to calculate the value of playing a player employing the "always fold" strategy. Correctly, the algorithm pegs the value of the game at 75 BB/100h.
Now the authors compute the exploitability of the "always raise" strategy. The paper cites a posting in the twoplustwo.com Poker Theory forum that quotes the value of playing this opponent at 370 BB/100h, and, again, there is agreement between the two sources.
Now the authors turn to the algorithms that were entered in the 2010 ACPC. The exploitability of each entrant is expressed in Table 2. We start by looking at the University of Alberta's very own Hyperborean.IRO. It is exploitable to the tune of 13.5 BB/100h. Other entrants are more exploitable, one to the tune of 40 BB/100h. Despite the fact that Hyperborean.IRO is the least exploitable of these algorithms, it wasn't the winner in this competition! Not all entrants were designed to play optimally, and obviously some do a better job of exploiting their weaker opponents than the current incarnation of Hyperborean.
Note, of course, that despite the fact that Hyperborean.IRO strives to not be exploitable, we have little idea what strategy it would take to extract value from it. Nor do we know how an optimal player would perform against it. Also, take note that these rates do not include a rake. Add in a rake of 0.14 BB/hand for each player, and nobody could be +EV against the current Hyperborean.
The authors then explore the question as to whether more abstraction bins are always better. Figure 4 exhibits a monotonic decrease for each algorithm as the number of bins increases. At least for the algorithms examined in this paper it appears that more bins is always beneficial. Of course, there may be a limit at which this benefit breaks down.
We have already learned that some of the more exploitable algorithms performed better in the computer poker competition. The authors explore this by modifying the 2008 version of the CPRG program known as Polaris. They alter its programming to assume that the payoffs it receives are better in some circumstances than in a true zero-sum game. The authors call this "tilting" the game, which is obviously used in a different context than it is normally used by the poker community.
The results of this are displayed in Table 3. If Polaris assumes it receives extra value when it wins at showdown or when its opponent folds, it actually plays less exploitable. This is an interesting result. For some reason, this algorithm's attempt at optimal behavior is easier to exploit than a similar version that plays a littler looser and more aggressively.
One popular method for tuning a poker-playing program is to iteratively improve itself using a technique such as counterfactual regret minimization, which was discussed in a paper I covered in an earlier article. However, as we see from Figure 6, after some initial improvements successive iterations become more exploitable. The authors equate this to a form of statistical overfitting. It's certainly similar to the classical overfitting problem, but I'm not certain that exactly the same thing is going on.
In any case, it would be very difficult to notice this without being able to measure the exploitability of the program. I don't recall seeing any other paper bring up this issue. I expect this result to be an eye opener for several research teams using this technique.
Conclusion
This paper covered a great deal of ground. There are several interesting results that arise from it that are worth reiterating. First, the authors demonstrate a way to measure exploitability that is significantly faster than conventional methods. Second, it's clear that the best computer poker algorithms, as represented by entrants into the ACPC, are becoming less exploitable over time, but are still significantly beatable by a nemesis opponent if we ignore the rake. Third, increasing numbers of abstraction bins still seem to be beneficial for limit Texas hold 'em algorithms. Fourth, intentionally skewing select poker playing algorithms to make them looser and more aggressive may actually make them more difficult to exploit. Fifth, beyond a certain point successive iterations of poker algorithms using methods such as counterfactual regret minimization can make a refined strategy more exploitable.
That's a lot of progress for one paper, and these are very interesting results. I expect this one to be cited quite a bit by future poker algorithm researchers. I found it to be quite an eye-opener.
I'd like to thank one of the paper's co-authors, Michael Johanson, for pointing me at this paper and taking the time to make suggestions that greatly improved this article. Of course, any errors or omissions are solely my responsibility.


