## Wednesday, July 1, 2009

### Q-Learning

A Markov Decision Process MDP is defined by a set of states together with a set of allowed actions for each state a transition model which gives the transition probabilities between states when an action is taken and finally a reward model which gives a numerical reward for any (beginning state, action, end state) triple. A great many things can be described as MDP's Pacman for instance is an MDP. In fact just about every game is an MDP. Games with hidden information are POMDP's that is Partially Observable Markov Decision Processes.

If you want to win a game one way of doing it is treating the game as a MDP and then try to solve for the optimal value function over the states. The optimal value function on the states is the expected total cumulative value of all the rewards you can expect by acting optimally once in that state. Since a lot of games don't have an absolute stopping point beyond which you can't move generally you use a discounted reward model instead which basically means you prefer short term rewards to long term rewards to some extent. To figure out what to do in a MDP you can try to learn the transition model for the states and learn the reward model and from there you can try to solve the MDP to get the optimal value function.

Because trying to learn the transition model as well as the rewards is kind of a pain a popular alternative is to instead try to learn values over all pairs of actions and states, this is called Q learning. In Q-learning what you do is you keep track of what you think the value of any particular state is and then pick an action going from that state and update the value of the state action pair composed of the state you came from and the action you took based both on your current belief about the value of the state you transitioned into (which you can find by taking the maximum of the Q values of the new state) and any rewards you might have gotten along the way. Viola! you have a working means of picking actions in any state, you just look at the Q values and pick the action associated to the one with the highest Q value.

Sorry if this explanation is horrible (because it is) you should probably look it up on wikipedia or something or if you really care then google "sutton and barto reinforcement learning" and you can read a book about it online.

Now my beef with Q learning is this, in game theory you spend a bunch of time figuring out that pure strategies optimally solve only a very small fraction of games. I am talking here about two player zero sum games. Actually my beef is not so much with Q learning in general tasks but specifically in the case of two player zero sum games. Consider the game paper rock scissors. There are three strategies namely the strategy rock the strategy scissors and the strategy paper. A q learner set loose on this game would at first try something at random say paper and win because their opponent chose rock. Now the q-learner thinks that paper is the best strategy because the first time it tried it it won so next time around it tries it again and loses because their opponent chose the strategy scissors. The point is that someone else who knows that their opponent is a q-learner will know that the q-learner will tend to avoid whatever they just chose if they lost with it and tend to pick whatever they just won with over again. So q-learners fail at figuring out RPS. From a game theoretic standpoint what a deterministic q-learner is doing is assuming that every game has a saddle point (which is to say it has an optimal pure strategy). A popular alternative to the deterministic Q learner is a Q learner which chooses between different options by not choosing the "best" Q value but rather choosing from among the different actions using a boltzmann distribution with some temperature. The lower the temperature the closer the Q-learner is to deterministic but with any finite temperature a boltzmann (or softmax as they are actually called) Q learner will learn to play RPS correctly eventually. However the current wisdom is to use the temperature of the boltzmann Q learner as simply a means to allow for convergence and then limiting the temperature to 0 (meaning making the thing become deterministic). But from my perspective the temperature is clearly a feature. In fact I wonder if it might not be the case that picking a finite "equilibrium" temperature for each state would actually yield the correct optimum game theoretic mixed strategy. I don't really have any reason to believe that the distribution over the actions would converge to values in the appropriate way so that you could use the boltzmann distribution to get the correct mixed strategy but it is just so pretty that even if it isn't true it couldn't hurt to apply it and see how it goes.

Now that I have been thinking about it a bit more the pretty idea is sunk or at the very least incomplete. That is because the whole point of optimal mixed strategies is that the expected payoff of following each pure strategy is the same as all the others (assuming that the opponent is playing according to their optimal mixed strategy). So if the Q learner were in fact to come to learn the optimal values for each action/strategy it would learn that they were all the same. And yet we shouldn't allow it to choose between them with equal probability or we would not be acting optimally. I guess even trying to do something fancy like that the normal Q value system is just sunk as far as game theoretic actual optimality goes.

I guess what you would have to do is what my first thought for correcting this problem was. That is you solve a small matrix game for every state. This small matrix would consist of rows of all the actions you could take and then columns consisting of all the actions your opponent could take. You could then fill in the entries in the matrix by getting the values of the resulting states after those actions and then solve the matrix for the optimal strategy. Following the optimal strategy you can then do a one step update from your destination state.