Friday, July 3, 2009

approximate Q learning

So usually if a q learning problem is interesting it will have too many states to make it practical to keep track of the value of each individual state separately. So in approximate q learning you replace states with a set of features of the state and then you try to learn a value function on those features. So for instance in chess instead of keeping track of the value of each board configuration (since there are about 35^50 of them) you keep track of only a few features of the board configuration. You might for instance keep track of the total number of squares your pieces are attacking and the total number of squares your opponent is attacking. In general if your pieces are attacking a larger number of squares than your opponent then that denotes a board position that is good for you. In the case of normal linear approximation q learning you assume that the value function associated to a feature is a linear function through the origin. So say you find on one q learning update step that your reward and end feature state evaluation give you a value of 10 for a feature value of 5 then you would expect the value function for that feature to be a line with a slope of 2. Generally every time you make an observation you would nudge the slope of the function a bit towards the slope that you observe.

A linear function through the origin does not seem to me to be a particularly hopeful candidate for being able to accurately describe a utility function. In fact if I were to choose a family of one parameter functions to try and describe a utility functions I would probably go for exponential decay. (since sparsity is a good thing maybe this is in fact exactly what I should do) But for my wavelets end of term project last semester I decided that it might be more interesting to instead allow for totally general utility functions but try and make good guesses by assuming that they are sparse in a wavelet basis. With that assumption you could even use compressive sensing in order to make efficient use of your observations and then even if you are wrong about the sparsity of the utility function in the wavelet basis thats ok because you still have the ability to learn an arbitrarily good approximation of that function.

I actually implemented said learning algorithm and used it on the pacman problem but it kinda sucked. My wavelets professor thought that the work was cool enough that I might want to pursue publication of some sort. I have been talking to Hal my AI professor and he also seems to think it is potentially publishable. So possibly the aftermath of the pacman competition could also give me my first publication!

No comments: