Wednesday, July 1, 2009

Q-Learning

A Markov Decision Process MDP is defined by a set of states together with a set of allowed actions for each state a transition model which gives the transition probabilities between states when an action is taken and finally a reward model which gives a numerical reward for any (beginning state, action, end state) triple. A great many things can be described as MDP's Pacman for instance is an MDP. In fact just about every game is an MDP. Games with hidden information are POMDP's that is Partially Observable Markov Decision Processes.

If you want to win a game one way of doing it is treating the game as a MDP and then try to solve for the optimal value function over the states. The optimal value function on the states is the expected total cumulative value of all the rewards you can expect by acting optimally once in that state. Since a lot of games don't have an absolute stopping point beyond which you can't move generally you use a discounted reward model instead which basically means you prefer short term rewards to long term rewards to some extent. To figure out what to do in a MDP you can try to learn the transition model for the states and learn the reward model and from there you can try to solve the MDP to get the optimal value function.

Because trying to learn the transition model as well as the rewards is kind of a pain a popular alternative is to instead try to learn values over all pairs of actions and states, this is called Q learning. In Q-learning what you do is you keep track of what you think the value of any particular state is and then pick an action going from that state and update the value of the state action pair composed of the state you came from and the action you took based both on your current belief about the value of the state you transitioned into (which you can find by taking the maximum of the Q values of the new state) and any rewards you might have gotten along the way. Viola! you have a working means of picking actions in any state, you just look at the Q values and pick the action associated to the one with the highest Q value.

Sorry if this explanation is horrible (because it is) you should probably look it up on wikipedia or something or if you really care then google "sutton and barto reinforcement learning" and you can read a book about it online.

Now my beef with Q learning is this, in game theory you spend a bunch of time figuring out that pure strategies optimally solve only a very small fraction of games. I am talking here about two player zero sum games. Actually my beef is not so much with Q learning in general tasks but specifically in the case of two player zero sum games. Consider the game paper rock scissors. There are three strategies namely the strategy rock the strategy scissors and the strategy paper. A q learner set loose on this game would at first try something at random say paper and win because their opponent chose rock. Now the q-learner thinks that paper is the best strategy because the first time it tried it it won so next time around it tries it again and loses because their opponent chose the strategy scissors. The point is that someone else who knows that their opponent is a q-learner will know that the q-learner will tend to avoid whatever they just chose if they lost with it and tend to pick whatever they just won with over again. So q-learners fail at figuring out RPS. From a game theoretic standpoint what a deterministic q-learner is doing is assuming that every game has a saddle point (which is to say it has an optimal pure strategy). A popular alternative to the deterministic Q learner is a Q learner which chooses between different options by not choosing the "best" Q value but rather choosing from among the different actions using a boltzmann distribution with some temperature. The lower the temperature the closer the Q-learner is to deterministic but with any finite temperature a boltzmann (or softmax as they are actually called) Q learner will learn to play RPS correctly eventually. However the current wisdom is to use the temperature of the boltzmann Q learner as simply a means to allow for convergence and then limiting the temperature to 0 (meaning making the thing become deterministic). But from my perspective the temperature is clearly a feature. In fact I wonder if it might not be the case that picking a finite "equilibrium" temperature for each state would actually yield the correct optimum game theoretic mixed strategy. I don't really have any reason to believe that the distribution over the actions would converge to values in the appropriate way so that you could use the boltzmann distribution to get the correct mixed strategy but it is just so pretty that even if it isn't true it couldn't hurt to apply it and see how it goes.

Now that I have been thinking about it a bit more the pretty idea is sunk or at the very least incomplete. That is because the whole point of optimal mixed strategies is that the expected payoff of following each pure strategy is the same as all the others (assuming that the opponent is playing according to their optimal mixed strategy). So if the Q learner were in fact to come to learn the optimal values for each action/strategy it would learn that they were all the same. And yet we shouldn't allow it to choose between them with equal probability or we would not be acting optimally. I guess even trying to do something fancy like that the normal Q value system is just sunk as far as game theoretic actual optimality goes.

I guess what you would have to do is what my first thought for correcting this problem was. That is you solve a small matrix game for every state. This small matrix would consist of rows of all the actions you could take and then columns consisting of all the actions your opponent could take. You could then fill in the entries in the matrix by getting the values of the resulting states after those actions and then solve the matrix for the optimal strategy. Following the optimal strategy you can then do a one step update from your destination state.

Artificial Intelligence and Brain Damage

Lets suppose for a moment that we achieve in the next decade or so a series of breakthroughs in nanotechnology and AI to allow the creation of essentially perfect duplications of someones personality as an AI construct. I would be willing to say that if a person receives massive brain damage at some point in this era it would make sense to equip them with something of an artificial brain. Now remember that I said that the reproduction of their personality would be exact and so all of their characteristics would remain intact when this change took place. I would be an advocate of this sort of procedure and actually I would be an advocate of essentially the same procedure even if it was death instead of brain damage that had occurred. For the moment ignoring the moral implications of the resulting overpopulation problems and possible stagnation problems.

But what about people (like my mother) who experienced debilitating brain damage decades ago and has since evolved into someone who is very different from their original self although still sharing similarities? What if you could go in and "correct" their brain and restore their old capacities and old personality while letting them retain their accumulated memories of the intervening time. That would be unfair to the person living now but leaving things as they are is perhaps unfair to the person as they were. The problem is of course not really very much of a problem because the moral dillemma really only has one solution. The person who exists now can choose what they want. I guess I'm just babbling isn't the internet wonderful? I can post crap like this and nobody cares! wait... nobody cares... /sob

Compressive Sensing

The idea behind compressive sensing is that any interesting signal is not going to look like noise. If you want to reconstruct the signal you probably don't need to sample all the data in the signal but just enough to capture the information in the signal. Say for instance you know the signal is a perfect linear signal. You can just sample two points and be done with it. In essence the idea behind compressive sensing is that real signals tend to be approximately sparse in most bases. Here being approximately sparse in a wavelet basis means that most of the coefficients of the wavelet transformation are either zero or close to zero. The compressive sensing community uses the terminology that a vector is k sparse if the vector has k non zero entries, and approximately k sparse if all but k entries are close to zero. Now the oh so nifty thing about compressive sensing is that you only need something like k*log(N/k) measurements to get the data in the signal where N is the number of coefficients in the signal. What you do is you take your measurements at random, you use these measurements to create constraints on a linear program whose variables are the coefficients of the signal you want to recreate and then minimize the sum of those coefficients.

coding up something like that however would be a bit complicated, at some point it would be great if I were to program something capable of doing the simplex algorithm for linear programming but for the moment it is a project I am not willing to undertake. I wanted to use compressive sensing for my wavelets project at the end of the semester but I ended up not doing it because I couldn't code it (at least in time) and I didn't understand it all that well. What I ended up using was an algorithm that I just made up. The algorithm was relatively easy to code (compared to a general LP solver) and that is essentially why I ended up using it. The algorithm takes a set of direct measurements of a signal and then makes a bad guess for what the signal ought to look like. It then takes this bad guess and takes its wavelet transform. It also takes the wavelet transform of a vector of 1's where the measurements were made. This keeps track of where the wavelet coefficients contain direct information about the signal and where not. Starting at the lowest level of the wavelet transform we look at the transform of the information vector and if it has a non-zero coefficient in a certain place then we do a reverse wavelet transform ignoring the detail coefficients and put the value for that place in the reverse transform in that place in the signal. We continue to do this for every level all the way up until the last step is simply putting the observed measurements in place.

The algorithm as I just described it is pretty sucky, it is slow and inefficient but does give pretty good estimates of the signal. An interesting thing to note however is that iteration of the algorithm with the same observations but starting with the output of the algorithm as the "bad guess" input for the next iteration yields better results. I suppose I haven't really tested how much "better" the results are but the iterated version is totally a compressive sensing algorithm of sorts. Intuitively it makes perfect sense that since we throw away the detail coefficients every time we do an iteration we get closer and closer to a sparse vector in the wavelet basis. I haven't really done a proper analysis so I don't know if that is really true or not. I certainly am not willing to go so far as to say that the iteration converges onto the L1 minimization but I am willing to say that it does in fact do that in at least some simple cases.

Tuesday, June 30, 2009

The NIF is online!

Sad to say it took me this long to find out that the National Ignition Facility came online in march this year!!!

http://www.wired.com/science/discoveries/news/2009/05/gallery_nif?currentPage=1

the pictures are amazingly cool looking and they are expected to finally be able to get more energy out than they put in with this one!!! In case you are wondering I checked on ITER too and they are not expecting to switch on until 2018. Makes you wonder if icf isn't the future of fusion after all.

blowing up Jupiter again

Because it has been so very long since the last time I discussed the blowing up of Jupiter on this blog I figure it is time again. Furthermore I really think it would be very healthy for me to try to make a blog post every day. Up to this point I have always been serious about the ways that one might employ in order to blow up poor Jupiter but now I think it is time to put the constraint of serious aside. I think I am going to start posting methods of blowing up Jupiter that just don't make any practical sense. For instance we could try to turn jupiter into a black hole by shooting a single extremely highly energetic proton into its atmosphere. If the proton had the kind of energy of say 10^45 J then the first particle it hit would cause a collapse into a positively miniscule black hole hurtling towards the planet giving off insanely high energy hawking radiation in every direction this black hole would probably evaporate but the high energy radiation coming out from the black hole would cause a huge pressure wave which would itself spawn multiple black holes in the pressure wave and the whole planet would become a huge explosion. Unfortunately the formation of all these tiny black holes would probably ensure that instead of collapsing into a black hole afterwards the planet would just explode and completely destroy the solar system and very probably any other system within say a few hundred light years. Jupiter is a loss then but the shockwave could very possibly cause the collapse of hundreds or thousands of stars into black holes. Admittedly this doesn't seem very appealing but it is just such a pretty picture, sending one particle into a planet to make a apocalyptic explosion.

The weight of light continued

The previous weight of light blog piece was inspired by thinking about how big the mass of the cosmic background radiation is. As per my previous post an individual photon does not have mass but a collection of photons generally will. At any rate since the cmb is isotropically distributed that means it has a net zero linear momentum and so we can find its rest mass by the simple application of the E = mc^2 law. The cmb is basically a perfect blackbody distribution so we can use the relation u = 8pi^5t^4/(15h^3c^3) Where t is the temperature in joules. Since the cmb has a temperature of about 2.7 kelvin that translates to 3.7 x 10^-23 joules. That yields an energy density of about 4 x 10^-14 J/m^3 a quick google search provides a radius for the universe at roughly 78 billion light years. A light year = 9.4605284 × 10^15 meters so we get a volume of 4/3*Pi*r^3 = 4/3*Pi*(78000000000*9.4605 x 10^15)^3 = 1.6 x 10^81 m^3 So multiplying the two gives a value for the total energy content of the cmb which is 6.4 X 10^67 J. Translating this into Kg we get m = 7.1 x 10^50 Kg.

The mass density of visible matter in the universe is estimated at 3 x 10^-28 kg/m^3 so the total mass of the visible matter using the above given value for the volume of the visible universe gives a mass of 4.8 x 10^53 Kg. So the cosmic background radiation weighs in at about a thousandth of the mass of visible matter in the universe but still a not insignificant contribution to the overall mass. Anyway I thought it would be fun to post t3h calculation.

We Won!!!!

I can't believe I forgot to put up a post about this. We won the pacman competition our team came out on top over berkeley and I didn't have to take the final in AI. I'd write a more detailed story but its late and I am going to bed.

Monday, June 29, 2009

The weight of light

It is a fairly commonly known fact that the photon has no mass. While this is true for a single photon it is not necessarily true for collections of photons. Lets take a step back for a moment and say what we mean when we say "rest mass". In both Newtonian dynamics and relativity the energy of an object is a function only of its velocity. In Newtonian dynamics the zero for the energy scale is determined arbitrarily and only changes in energy have meaning. In other words it doesn't matter whether you say a baseball going 2m/s has an energy of 4 joules or -10 joules all that matters is that when the ball goes from 2m/s to 4m/s the energy of the ball goes from 4 to 16 or from -10 to 2. Once a zero was picked for the energy scale you could talk about changes in energy and that was all that mattered. Relativity changed all that because in relativity there is a more absolute meaning to energy. I don't remember the derivation but the end of the story is that the equation E^2 = (pc)^2 + (mc^2)^2 holds. Here p is momentum m is mass E is energy and of course c it goes without saying is the speed of light. Solving this equation for mass you get m = c^-2 * sqrt(E^2 - (pc)^2) Just like for any other particle light obeys the relation p = h/L where L is its wavelength. Also E = hf and f = 1/T = c/L so
E = hc/L plugging this in to the first equation we get that m = c^-2*sqrt((hc/L)^2 - (h/L * c)^2 = 0 and all is well. So you might think that the mass of a system of 2 photons would also be massless 0 + 0 = 0 right? Well... maybe and maybe not. A single photon had no mass because it just so happened that for a photon E = pc so E^2 - (pc)^2 = 0. But when you consider a system of 2 photons while its energy is just the sum of the energies of the two particles the momentum of the system is the vector sum of the two momenta. So if you have two photons of the same wavelength moving exactly opposite of each other then you get 0 momentum for the system. Now you find that the system of 2 photons does indeed have mass!! m = 2*h/(cL).

So here is the big idea I want to know what happens when two photons traveling perfectly parallel to one another with one in front of the other by a little distance hit a mirror. At first they are traveling with each other so their momenta add and we have a system with 0 mass. But after the first photon hits the mirror and gets reflected we now have a system with mass because now their momenta cancel. This is where an awesome experimental setup comes into play. The LIGO setup is built in order to be an incredibly sensitive instrument for the direct detection of gravity waves. This is done by making a gigantic perfectly tuned michelson inferometer. You watch the fringe on the inferometer and look for changes. Theoretically if there are no gravity waves there shouldn't be any changes to the fringe but of course because the thing is so damn sensitive it actually picks up vibrations from people walking around a mile away and stuff like that. Fortunately the changes in the fringe from local vibration and what you would see from a gravity wave are different. Now the thing is that we still haven't detected a gravity wave with the LIGO. So here is the deal why not try and make a gravity wave right smack on top of the damn thing? The idea is to shove an extra little laser pulse on top of the continuous beam already circulating in the inferometer. every time the laser pulse passes we add a little bit to it. The laser pulse would have to be extremely brief and extremely powerful in order for it to make a measurable g-wave. I don't really have the necessary knowledge to know exactly how powerful it would have to be to be detectable but as the wave began to hit a mirror its mass would increase to a peak as half the pulse has been reflected and then go back to zero as the rest of it got reflected. The mass generated by the pulse even at its peak would be absolutely minute but it would go from 0 to full mass in the tiniest fraction of a second. I don't know if that violence would be enough to make a measurable gravity wave either. The point is it is a totally freaking awesome idea. Use the device built to detect gravity waves to make them!!!!

To put things in perspective the gravity wave would be extremely extremely weak. Even assuming we could get the total energy in the burst to be somewhere around the order of a petajoule the mass generated by the pulse would be about a ten thousandth of a gram. So a bounce event at one of the mirrors would be something like a mote of dust appearing and vanishing in the tiniest fraction of a second. Now on top of the absolutely tiny mass we are faced with an additional 11 orders of magnitude or so coming from the gravitational constant. So even a petawatt burst would probably not be enough.

Actually I just realized that I've been ignoring the fact that the momentum from the light being reflected goes into the mirror so when you consider the mirror and photons as a system the total rest mass is unchanged. so the whole idea is dead in the water. Still though it was an interesting thought.

Wednesday, February 4, 2009

Rock Paper Scissors Lizard Spock

I am quite the fan of rock paper scissors. Usually the attraction of playing rock paper scissors is that the outcome is viewed as random. Random is here taken to mean that all outcomes have equal probability. Because there is no obvious advantage to choosing say rock over scissors we assume that the choice between these elements is done at random in which case the game really does provide a random decision process. But suppose that the entities playing the game are not capable of truly random behavior (e.g. they are human) in which case there is now an incentive to try and predict what the action of your opponent is going to be. Especially if you are playing a game of RPS such that the victor is decided by the best two out of three or the best three out of five. Because then you have the data of what happened on the previous play which will give you an extra boost to decide what to do now. I once heard of a computer program that was written to analyze the probability distribution of the behavior of human opponents in a game of RPS and after a little while of observing the human player playing the computer opponent was able to beat the human player somewhere around 60-70% of the time.

I have been thinking rather a lot lately about different methods by which it might be profitable to do behavior prediction of agents playing games. The paper rock scissors type game seems to be a rather simple and interesting type of testing grounds on which i could try to test my methods of strategy analysis. Today I stumbled upon a game called Rock Paper Scissors Lizard Spock which is a simple generalization of the idea of a RPS type game

http://www.samkass.com/theories/RPSSL.html

In RPS we are using a directed version of the complete graph on 3 elements k3, which is just a triangle. The win loss relationships in RPSSL are more complicated because we are using k5 instead which is now more than just a loop. We can make a further generalization of this type of game by using the complete graph on n nodes kn and then randomly choosing a direction for each of the connections between the nodes denoting which node wins in a contest of the two. While admittedly this doesn't make for very interesting humanly playable games it does provide a convenient way to generate a large field for testing ideas for strategy prediction.

The method used in the aforementioned RPS playing computer program was Markov chaining. Which basically means the program would just look at the last couple of plays of the human player and take them to be the indicator of what they would do next. The program analyzes the probabilities of the human player and then just does what the probabilities suggests it should. For a human player playing RPS this works well. Part of the reason for this is that the state space for RPS is small. Meaning that the previous play combinations have only a relatively small total number of possibilities. There are 9 possibilities of play at each step in the game meaning that if we want to try and get decent probabilities based on a one step Markov chain we only need to observe behavior for perhaps a couple dozen turns. Of course a branching factor of 9 is still fairly large and no human player is going to play the computer long enough to make the 5 depth probabilities accurate with 95 = 59049 But that is probably not likely I think the 1 depth probabilities would give decent performance and 2 depth probabilities better. Even 3 depth probabilities with 93 = 729 are certainly within reach for long games. Since I am interested primarily in being able to beat other computers and not other humans though even depth 8 or 9 Markov chains are not out of reach computationally. I would be willing to hypothesize that computers playing against other computers with the same sort of Markov models would generally tend to end up tying each other. I think it would be an interesting exercise to code up such generalized RPS games and pit different prediction strategies against one another. I'll post if it ever gets done.

Monday, February 2, 2009

Pacman competition

This semester I am taking an upper division computer science course on artificial intelligence more or less for fun. The CS department tries to only let full cs majors into the upper division classes so I had to go through a few people in order to actually get into the class. So far the class has been fairly interesting and quite a bit of fun. The first two programming projects in the class are centered around code for a Pacman game. This code (or more likely a modification of it) will form the basis for a competition at the end of the semester. The most exciting part about this competition is that it will have two parts an intra university competition here at the U and an inter university competition between the U and an AI class being run at Berkeley. I was planning on taking this course before I knew about the competition but this competition has elevated the class to very near top priority among my classes.

Although the actual complete details of the competition have not yet been revealed the first two programming projects have been working with Pacman code and I have been thinking about some Pacman related problems as a result. Therefore I think I am going to begin posting about Pacman code and Pacman related problems. For now just know that my Pac-agents are destined for victory!!!!!

-P.S. given an set of integer coordinates and allowing only movements on a grid what is the most efficient algorithm for calculating the minimum path distance to visit all the points in the given set?