I borrowed my girlfriends tiny little laptop to go with me on my trip to Idaho and back for my family reunion. I stayed a few days longer than everybody else and helped my parents with some things they needed done. Just before I was about to get on the bus to go back to salt lake my mother asked if she could see if she could type on this itsy bitsy 8.9 inch laptop. Since I already had a blog post window up so that I could type a blog entry on my way back I just let her use my blog post window. This is what she typed.

Dear Tim,

Ilove you very much! I hope you have a wonderful time on your birthday!!!!!!!!!!!!!!!!

Love from all

## Wednesday, July 29, 2009

## Monday, July 27, 2009

### Topics

I have been trying to think of some good blog posts over the past few days and have begun writing a whole bunch of different ones. I keep on thinking that either I am not doing a good enough job describing a good topic or I don't really have anything worth saying. But I have finally decided on a blog post topic. Namely the topic of all the weird topics I have been trying to write up.

Artificial Intelligence and Tetris

Neural computer interfaces

plausibility of visiting distant black holes

the evil sheriff arpaio

gravity and the multiverse (dark energy)

the importance of science fiction

my family reunion

genetic sequencing programs

false discovery rate statistics

problems and benefits of short term economic models

finite symbol combination systems

random wavelet coordinates

Some of these topics will definitely get put up but others will no doubt fall by the wayside. I just thought it might be enlightening to know that I often start 4 or 5 blog posts where I publish 1.

Artificial Intelligence and Tetris

Neural computer interfaces

plausibility of visiting distant black holes

the evil sheriff arpaio

gravity and the multiverse (dark energy)

the importance of science fiction

my family reunion

genetic sequencing programs

false discovery rate statistics

problems and benefits of short term economic models

finite symbol combination systems

random wavelet coordinates

Some of these topics will definitely get put up but others will no doubt fall by the wayside. I just thought it might be enlightening to know that I often start 4 or 5 blog posts where I publish 1.

## Wednesday, July 22, 2009

### Musical Tesla Coils

The tesla coils are actually making the sounds. This isn't just a light show synchronized to music. The tesla coils are producing the music!!!!

It is done by modulating how fast you turn on and off the coils appropriately so as to make arcs which will give the correct frequency response so as to play music.

It is done by modulating how fast you turn on and off the coils appropriately so as to make arcs which will give the correct frequency response so as to play music.

## Tuesday, July 21, 2009

### Girl Genius

The webcomic Girl Genius is by far the best webcomic that I have ever seen. It was linked to me some time ago by my girl friend.

THE MOST BRILLIANT WEBCOMIC EVER!!!

I went through the series within a day or two and became current with the story very quickly. I just wanted to share the brilliance. Read the series and enjoy.

THE MOST BRILLIANT WEBCOMIC EVER!!!

I went through the series within a day or two and became current with the story very quickly. I just wanted to share the brilliance. Read the series and enjoy.

## Monday, July 20, 2009

### Metrics on the Reals

It has been something of a prime activity of my recent life to try and make a coordinate system for non integer dimensional spaces work. Although I have tried rather a lot of different approaches I have never really come up with anything satisfactory. For instance, at one point I considered a sort of pseudo euclidean quotient space which had the appropriate scaling law for the "volumes" of spheres by radius. However the space was less well behaved for concave sets. In fact the "volume" of a sub set of a concave set might actually be greater than the "volume" of the whole. Which at the very least means that the volume scaling law doesn't hold the same for all shapes in that space which was pretty much the kiss of death for that idea.

Other ideas have had a much longer and less clear history, for instance I have been mucking about thinking about using random coordinate systems and fuzzy logic. How random coordinate systems or fuzzy logic might be used in a concrete way to create a coordinate system I don't know. Which of course is why I still give it so much thought. One idea using random coordinate systems is to assume an infinite set of random vectors which have a particular probability distribution for the value of their dot products. Assuming uniform distribution of the vectors around the surface of the appropriate dimensional sphere gives a very specific expected dot product distribution for each dimension. If we assume a distribution somewhat in between say the 2 and 3 dimensional distributions then perhaps we would have a consistent coordinate system, albeit one with a necessarily infinite number of coordinates. This idea while pretty is something that I have not really gotten very far with. I really should put some sweat into it and see if I can make it work.

All of this is not really the point I was trying to make though (perhaps I should just rename this post and skip what I was trying to say) What I was thinking about recently is the fact that because the cardinality of the real numbers is the same as the cardinality of any euclidean space of any dimension (at least integer dimensional ones and presumably non integer dimensional ones too) you can find a bijective mapping from a space of any dimension onto the interval (0,1).

In other words as long as the cardinality of the point set of a space of non integer dimension is the same as the cardinality of the integer dimensional ones then lurking somewhere in the set of functions on the interval between 0 and 1 is the metric for any dimensional space you care to think of.

For this reason I have been thinking that perhaps the best way to try and think about non integer dimensional spaces is to think about real number theory. The kind of stuff where you talk about recursive function mappings of the real numbers for instance the mapping which we use as the basis of the decimal system. The decimal system can be thought of as the output of an algorithm which maps the interval from zero to one to itself. Say you want a decimal representation of any number. You begin by taking its integer part and then you minus that part out and multiply by 10 and then take the integer part of that and then rinse and repeat.

Perhaps by considering mappings of the unit square to itself we might come up with a suitable metric.

Other ideas have had a much longer and less clear history, for instance I have been mucking about thinking about using random coordinate systems and fuzzy logic. How random coordinate systems or fuzzy logic might be used in a concrete way to create a coordinate system I don't know. Which of course is why I still give it so much thought. One idea using random coordinate systems is to assume an infinite set of random vectors which have a particular probability distribution for the value of their dot products. Assuming uniform distribution of the vectors around the surface of the appropriate dimensional sphere gives a very specific expected dot product distribution for each dimension. If we assume a distribution somewhat in between say the 2 and 3 dimensional distributions then perhaps we would have a consistent coordinate system, albeit one with a necessarily infinite number of coordinates. This idea while pretty is something that I have not really gotten very far with. I really should put some sweat into it and see if I can make it work.

All of this is not really the point I was trying to make though (perhaps I should just rename this post and skip what I was trying to say) What I was thinking about recently is the fact that because the cardinality of the real numbers is the same as the cardinality of any euclidean space of any dimension (at least integer dimensional ones and presumably non integer dimensional ones too) you can find a bijective mapping from a space of any dimension onto the interval (0,1).

In other words as long as the cardinality of the point set of a space of non integer dimension is the same as the cardinality of the integer dimensional ones then lurking somewhere in the set of functions on the interval between 0 and 1 is the metric for any dimensional space you care to think of.

For this reason I have been thinking that perhaps the best way to try and think about non integer dimensional spaces is to think about real number theory. The kind of stuff where you talk about recursive function mappings of the real numbers for instance the mapping which we use as the basis of the decimal system. The decimal system can be thought of as the output of an algorithm which maps the interval from zero to one to itself. Say you want a decimal representation of any number. You begin by taking its integer part and then you minus that part out and multiply by 10 and then take the integer part of that and then rinse and repeat.

Perhaps by considering mappings of the unit square to itself we might come up with a suitable metric.

## Wednesday, July 15, 2009

### How much my blog is earning me

Interestingly I don't know how to get a good count of how many people access my blog via any other means than using the adds thing. My adds account will tell me how many people have visited my blog but my blog wont... odd. At any rate apparently the adds account estimates how much money the publisher of an add will make per thousand impressions. My site has an impressive $0.05 estimated eCPM. Since my blog has had an impressive 42 page impressions so far this month (how that happened I shall never know) I am proud to annouce that means my page is making an estimated 5.8 microbucks per hour... yup... somehow I think the amount it costs to host my page is rather higher than the payout to the advertisers. There must be some sort of power law which relates the number of blogs that make google a certain amount of money and the number of said blogs. There must be a millions of blogs that actually cost google money and then like 0.01% of them that make them money. Since the hosting of blogs is so much less bandwith and storage intensive than video it suddenly makes sense that youtube is loosing hundreds of millions of dollars a year despite its vast popularity.

### Peak Posting

As is true for most things there is a point where quality and quantity of blog posts combine to make for an optimum flow of readers. In general you need volume of posts in order to draw readers but you need quality of posts in order to make them come back. But since my posts tend to have neither quality nor quantity I suppose this question is (like most everything else on this blog) purely academic.

The simplest model I can think up is that you have a quality index q between 0 and 1 which represents the likely hood of someone who stumbles upon the blog to return in the future and a quantity index p between 0 and infinity which represents how many posts you make per unit time and should sort of be vaguely help determine the number of readers you pull with those posts. I would say that probably the number of readers you pull varies something like log(p+1) Let us furthermore assume that a person has a total amount of time that they can devote to the blog and therefore the total quality of all the posts is constant so Q_total = p*q the last part of our model is how to use these factors to model reader flow. Since we attract C*log(p+1) random readers in a unit time and of those readers q of them come back we have a recurrence relation. The expected number of readers R_t+1 = qR_t + C*log(p+1). Letting L = C*log(p+1) we have R_t+1 = qR_t + L Now suppose there is a limiting number R_l = qR_l + L Solving we get R_l = L/(1-q) which is because the amount is a geometric series in q (I thought it was should have just trusted myself). So if we take the time limit we see that to maximize the number of readers that we have over the long term we should make our quality as high as possible at the cost of quantity.

Of course this was based on the assumption that quality of a post was directly proportional to the amount of time spent on it when in reality I suppose the quality is more like the logarithm of the amount of time you spend on it. Sure you can always make a post better but only perhaps if you are willing to spend some fraction again of all the time you have spent on it up to this point.

The simplest model I can think up is that you have a quality index q between 0 and 1 which represents the likely hood of someone who stumbles upon the blog to return in the future and a quantity index p between 0 and infinity which represents how many posts you make per unit time and should sort of be vaguely help determine the number of readers you pull with those posts. I would say that probably the number of readers you pull varies something like log(p+1) Let us furthermore assume that a person has a total amount of time that they can devote to the blog and therefore the total quality of all the posts is constant so Q_total = p*q the last part of our model is how to use these factors to model reader flow. Since we attract C*log(p+1) random readers in a unit time and of those readers q of them come back we have a recurrence relation. The expected number of readers R_t+1 = qR_t + C*log(p+1). Letting L = C*log(p+1) we have R_t+1 = qR_t + L Now suppose there is a limiting number R_l = qR_l + L Solving we get R_l = L/(1-q) which is because the amount is a geometric series in q (I thought it was should have just trusted myself). So if we take the time limit we see that to maximize the number of readers that we have over the long term we should make our quality as high as possible at the cost of quantity.

Of course this was based on the assumption that quality of a post was directly proportional to the amount of time spent on it when in reality I suppose the quality is more like the logarithm of the amount of time you spend on it. Sure you can always make a post better but only perhaps if you are willing to spend some fraction again of all the time you have spent on it up to this point.

### Tesla coils are cool

I have had a love of the tesla coil for a very long time. I'm not sure when it began but it was either in junior high or early on in high school.

http://en.wikipedia.org/wiki/Tesla_coil

Basically in a tesla coil you take some low voltage power source and step it up to a few kv which isn't hard. The really nifty part is where you then take that few kv and you run it through a spark gap and into the primary coil of a second transformer. The reason that this is all nifty like is because the spark takes an extremely small amount of time and has a waveform that has lots of very high frequency components. Because of this the output of the secondary coil gets a big kick both in voltage and in frequency. So you can run a tesla coil with good old 60 hz and get 10 khz out of the secondary coil. I haven't really given much thought to the design of tesla coils for the last couple years and it just sort of makes me happy that I understand the dynamics so much better now.

http://en.wikipedia.org/wiki/Tesla_coil

Basically in a tesla coil you take some low voltage power source and step it up to a few kv which isn't hard. The really nifty part is where you then take that few kv and you run it through a spark gap and into the primary coil of a second transformer. The reason that this is all nifty like is because the spark takes an extremely small amount of time and has a waveform that has lots of very high frequency components. Because of this the output of the secondary coil gets a big kick both in voltage and in frequency. So you can run a tesla coil with good old 60 hz and get 10 khz out of the secondary coil. I haven't really given much thought to the design of tesla coils for the last couple years and it just sort of makes me happy that I understand the dynamics so much better now.

## Tuesday, July 14, 2009

### I met a guy named Stan

I was walking home with some edibles when a man walked up to me with the refrain "hey man I am just tryin to scrounge up a buck. I just want a buck for a budweiser I been walkin around so long my feet hurt just." The guy seemed earnest enough and I gave him $2. I didn't really care what it was he wanted the money for. All that really mattered was that very clearly one dollar would make a significant difference to this person whereas to me (at least at the moment though hopefully it will stay this way) $1 doesn't really make much of a difference. After I had given him the money he introduced himself as Stan and he told me that next time we met hopefully he could pay me back.

At any rate this started me thinking about utility. Now utilitarianism is the idea that one should strive for the greatest good for the greatest number. But perhaps what one should really be trying to do is maximize the total utility of a group of people. If you have 2 people with no money and no place to live and you give one of them $10,000 and 2 houses to live in the overall utility goes up. But if instead you give both of them $5,000 and each one house to live in the gain in utility will be greater. This is because for the very poor the utility of having a single dollar is very high whereas the utility of a single dollar to the average American is very low.

At any rate this started me thinking about utility. Now utilitarianism is the idea that one should strive for the greatest good for the greatest number. But perhaps what one should really be trying to do is maximize the total utility of a group of people. If you have 2 people with no money and no place to live and you give one of them $10,000 and 2 houses to live in the overall utility goes up. But if instead you give both of them $5,000 and each one house to live in the gain in utility will be greater. This is because for the very poor the utility of having a single dollar is very high whereas the utility of a single dollar to the average American is very low.

### Music state space exploration

Obviously the state space of music is extremely large and humans will never really thoroughly explore it. At least no one particular human will since listening to all possible music sounds like one of those endeavors which would take a tad longer than the age of the universe. Of course in order to make the possible musical selections finite you need to simultaneously discretize the signal and limit its duration.

Even if we confine ourselves to sampling at a rate of say that of cd audio which is 44.1 khz or 44,100 samples a second and consider only a single minute of music that makes us consider a vector space of 2,646,000 dimensions. so even if we allow only say 100 different intensities at each time step that allows for 100^2646000 different sound bytes.

Every 1 minute sound clip (with the discretized intensity restriction kept in mind) is some point in this vector space. Now although it would be crazy to think that any human being might really thoroughly explore this space (as in listen to most of or even a tiny fraction of all possible sound clips in the space) we can still ask how well we have explored this space.

Obviously if you look at say only gregorian chant you are exploring a smaller region of the music state space than if you include also soft rock and heavy metal classical music etc.

So I propose a project that I probably will never do but intrigues me nonetheless. Why not use as a measure of the level of exploration of the music state space the convex hull of pieces of music. So we say pick a representative sample of rock music and we take the convex hull of these points in our music space and take the ratio of the volume of the convex hull to the total volume of the space as a measure of the level of exploration.

But say we took as "music" the basis vectors of the space. Then the convex hull of these "music" points would be the simplex for that dimension and while that might actually have a relatively low volume for the space as a whole it is still a volume which we can't really realistically expect our music to much out achieve and obviously the vector basis (namely a vector of one 1 and all zeroes else) is not something that really explores the music state space. So what we really want probably is to do something like take a spectrograph of the music and do our state space analysis with that.

Even if we confine ourselves to sampling at a rate of say that of cd audio which is 44.1 khz or 44,100 samples a second and consider only a single minute of music that makes us consider a vector space of 2,646,000 dimensions. so even if we allow only say 100 different intensities at each time step that allows for 100^2646000 different sound bytes.

Every 1 minute sound clip (with the discretized intensity restriction kept in mind) is some point in this vector space. Now although it would be crazy to think that any human being might really thoroughly explore this space (as in listen to most of or even a tiny fraction of all possible sound clips in the space) we can still ask how well we have explored this space.

Obviously if you look at say only gregorian chant you are exploring a smaller region of the music state space than if you include also soft rock and heavy metal classical music etc.

So I propose a project that I probably will never do but intrigues me nonetheless. Why not use as a measure of the level of exploration of the music state space the convex hull of pieces of music. So we say pick a representative sample of rock music and we take the convex hull of these points in our music space and take the ratio of the volume of the convex hull to the total volume of the space as a measure of the level of exploration.

But say we took as "music" the basis vectors of the space. Then the convex hull of these "music" points would be the simplex for that dimension and while that might actually have a relatively low volume for the space as a whole it is still a volume which we can't really realistically expect our music to much out achieve and obviously the vector basis (namely a vector of one 1 and all zeroes else) is not something that really explores the music state space. So what we really want probably is to do something like take a spectrograph of the music and do our state space analysis with that.

### Posting Times

I find it decidedly odd that this post was first posted about 2 minutes after the previous one but it is time stamp many hours away from the other one. I don't really like this time stamping system. It means that I can take months to finish a post but when I finally post it it will show up as though I posted it when I began it not when I finished it.

## Monday, July 13, 2009

### Function Spaces

One rather big surprise for me in my last little bit of undergraduate education is that both physicists and mathematicians often think of functions as being points in a vector space. This can be an incredibly powerful idea for instance the fourier transform is a projection onto a set of orthonormal basis vectors which are the appropriate family of complex exponentials. In the case of quantum physics these changes of basis take on actual physical meaning. For instance one can formulate wavefunctions as functions of position or momentum. These two wavefunctions are related to each other by a fourier transform or in other words by a change of basis. In fact the connection is even deeper than that. The heisenberg uncertainty principle is a side effect of the fact that compactly supported functions in one basis must have infinite support in the other basis. This is obviously not true of all bases we might choose (wavelet bases for instance) but the certain special bases that do have this sort of dual relationship seem to have a lot of interest for us. In fact an important part of the apparatus of quantum mechanics is using information about the commutativity of different operators. The fact that the position operator and the momentum operator do not commute implies that their bases are necessarily "inconsistent" in the sense that you can never have a wavefunction which has a finite representation in both.

But when we talk about these function spaces generally what we are talking about is L2(R) which is to say the space of square integrable functions on the reals. In general we could expand our horizons to say include L27(R) which is to say all functions for which the integral of the 27th power of that function is finite. But no matter which space you choose no integrable function space is going to include say the function x^2. It might seem odd at first that by far the most worked with function space L2(R) doesn't even include the polynomials (not any of the polynomials). But the reason is that we like dealing with functions which have a finite amount of area under their curves. This fact doesn't tend to ever be much of a problem because if you want a function which isn't in L2(R) then you just truncate it at some finite limit M and then let M go to infinity.

In fact it seems (at least for nice functions) that not only is the fourier basis a basis for functions in L2(R) but most any function on the reals. But there is of course a problem. I glossed over a little problem earlier, that is that the fourier transform of a function will not always perfectly reconstruct a function. If the function is continuous then all is well and the function is in fact exactly reconstructed. However at points of jump discontinuity the fourier transform fails to reconstruct the function and instead takes on the value of the midpoint of the discontinuity. The reason this isn't a problem is that we view functions in L2(R) to be "the same" if the "distance" between them is 0. Meaning basically that they differ only on a set of measure 0. Now when you move to functions which do not have a finite norm suddenly things become a whole lot more complicated. because the function is no longer bounded even if the function and its reconstruction from a fourier (or some other) basis differ only on a set of measure 0 there is no guarantee that the "distance" between them in the function space is 0.

Even more disturbing is that when we think about the "distance" between x and x^2 we come up with infinity. From the physical perspective it is actually a good thing that functions like x^2 are not part of the function space that is used to describe the real world. Otherwise we would allow infinite energy solutions to the wave equation. But I can't help but be deeply uneasy about the fact that there is no good way to incorporate even simple divergent functions into a nice function vector space like L2(R)

But when we talk about these function spaces generally what we are talking about is L2(R) which is to say the space of square integrable functions on the reals. In general we could expand our horizons to say include L27(R) which is to say all functions for which the integral of the 27th power of that function is finite. But no matter which space you choose no integrable function space is going to include say the function x^2. It might seem odd at first that by far the most worked with function space L2(R) doesn't even include the polynomials (not any of the polynomials). But the reason is that we like dealing with functions which have a finite amount of area under their curves. This fact doesn't tend to ever be much of a problem because if you want a function which isn't in L2(R) then you just truncate it at some finite limit M and then let M go to infinity.

In fact it seems (at least for nice functions) that not only is the fourier basis a basis for functions in L2(R) but most any function on the reals. But there is of course a problem. I glossed over a little problem earlier, that is that the fourier transform of a function will not always perfectly reconstruct a function. If the function is continuous then all is well and the function is in fact exactly reconstructed. However at points of jump discontinuity the fourier transform fails to reconstruct the function and instead takes on the value of the midpoint of the discontinuity. The reason this isn't a problem is that we view functions in L2(R) to be "the same" if the "distance" between them is 0. Meaning basically that they differ only on a set of measure 0. Now when you move to functions which do not have a finite norm suddenly things become a whole lot more complicated. because the function is no longer bounded even if the function and its reconstruction from a fourier (or some other) basis differ only on a set of measure 0 there is no guarantee that the "distance" between them in the function space is 0.

Even more disturbing is that when we think about the "distance" between x and x^2 we come up with infinity. From the physical perspective it is actually a good thing that functions like x^2 are not part of the function space that is used to describe the real world. Otherwise we would allow infinite energy solutions to the wave equation. But I can't help but be deeply uneasy about the fact that there is no good way to incorporate even simple divergent functions into a nice function vector space like L2(R)

### Polynomial Approximation

It recently occurred to me that calculus is really about finding good local polynomial approximations of functions. The derivative is often first introduced as finding the slope of a function "at a point". What this of course really means is we want to find the slope of the line which best locally approximates the function within a sufficiently small region of that point. When we learn about second derivatives we are told that we are simply taking the derivative of the first derivative which is true enough. But what we are really doing is we are finding the coefficient of the x^2 term in the best local quadratic approximator of the function. In general taking derivatives of a function just means finding the best local polynomial approximators and then looking at the appropriate coefficients for the order of derivative you want.

Of course in practice it goes exactly the other way. If you want a good polynomial approximimation of a function around a point you use the taylor series expansion of the function which you build up from its derivatives.

Of course in practice it goes exactly the other way. If you want a good polynomial approximimation of a function around a point you use the taylor series expansion of the function which you build up from its derivatives.

### Intelligence and Delay

You know all this thinking about q-learning has made me have a new perspective on the nature of delayed rewards. If you go around in psychological circles you will probably hear the term delayed gratification instead of delayed reward which is used in AI. There was a study done wherein children were asked to choose between eating one marshmallow now or getting 2 to eat when the researcher returned in 15 minutes. At any time before the 15 minutes were up the child could ring a bell and the researcher would come back and give the child one marshmallow but would not get the second one. The results of the study showed that those children who held out for the full 15 minutes at age 4 had SAT scores an average of 210 points higher than those 4 year olds who held out for only 30 seconds.

I remember hearing about a study very similar to this although I am pretty sure the study I heard of actually involved IQ tests. When I first heard this sort of result I thought it was interesting though not terribly surprising. Delayed gratification seemed like a perfectly logical effect of higher intelligence. The way I thought about the correlation between the ability to delay gratification and intelligence was simply that smarter people better realized the benefit of the delay. I now think this is very probably almost entirely wrong. The arrow of causation very probably goes completely the other way. After all the desirability of 2 marshmallows over 1 was apparently very clear to these children. It isn't really that only the smart ones figured out that 2 marshmallows is better than 1 or even that (as I thought) that the smarter ones can better understand and judge how much better the 2 marshmallows are. If our brains work anything at all like reinforcement learning machines (and I think they do) then the point is that people with smaller discount factors are smarter! A discount factor is sort of a means by which you can control the level that a reinforcement learning agent prefers rewards sooner to larger later rewards. In theory we don't really want a discount factor at all but in practice it is actually helpful and to some extent necessary to have one. In essence the problem is that if you don't have a discount factor then your reward estimates don't necessarily converge but they do converge if you do have a discount factor. In fact the smaller your discount factor (meaning the more you prefer short term to long term rewards) then the faster the convergence of the value estimates.

So maybe it isn't that delayed gratification is a side effect of higher intelligence. Maybe intelligence is a side effect of delayed gratification!!! In fact I think that a useful (if not necessarily the best) definition of intelligence might be "the ability to delay reward"

I remember hearing about a study very similar to this although I am pretty sure the study I heard of actually involved IQ tests. When I first heard this sort of result I thought it was interesting though not terribly surprising. Delayed gratification seemed like a perfectly logical effect of higher intelligence. The way I thought about the correlation between the ability to delay gratification and intelligence was simply that smarter people better realized the benefit of the delay. I now think this is very probably almost entirely wrong. The arrow of causation very probably goes completely the other way. After all the desirability of 2 marshmallows over 1 was apparently very clear to these children. It isn't really that only the smart ones figured out that 2 marshmallows is better than 1 or even that (as I thought) that the smarter ones can better understand and judge how much better the 2 marshmallows are. If our brains work anything at all like reinforcement learning machines (and I think they do) then the point is that people with smaller discount factors are smarter! A discount factor is sort of a means by which you can control the level that a reinforcement learning agent prefers rewards sooner to larger later rewards. In theory we don't really want a discount factor at all but in practice it is actually helpful and to some extent necessary to have one. In essence the problem is that if you don't have a discount factor then your reward estimates don't necessarily converge but they do converge if you do have a discount factor. In fact the smaller your discount factor (meaning the more you prefer short term to long term rewards) then the faster the convergence of the value estimates.

So maybe it isn't that delayed gratification is a side effect of higher intelligence. Maybe intelligence is a side effect of delayed gratification!!! In fact I think that a useful (if not necessarily the best) definition of intelligence might be "the ability to delay reward"

## Monday, July 6, 2009

### Learning stuff is rewarding

I have spent a great deal of time over the past few months thinking about Q-learning and other types of reinforcement learning. In a reinforcement learning problem you have a reward function which is given to to you as part of the problem and then you go about trying to maximize your reward. When it comes down to it though ultimately the reward function is something that a general AI would have to come up with on its own. As humans we are capable of figuring out what is good and what is bad and by how much. the reward function of each individual is rather unique, one person might find reading superman comics highly rewarding and another find it to be a complete waste of time. Although I don't really have much to say as for what one should do to try and create a good general reward function for human level intelligence. That doesn't mean, however, that I don't have some interesting ideas that could be turned into interesting experiments.

The best idea I have is to give rewards for learning the correct Q-values which we add onto the usual rewards. Lets say that for the moment we are working in a finite mdp for which we already have available the real values for each state. Then we could use the real values for each state to give a reward or punishment to the Q-learner based on how close their Q values are to the real values. This would be an awfully interesting way of encouraging exploration. In more general scenarios of course we would not have the true values available (after all if we do have them available why do we care about doing q learning?). So instead we would have to use some sort of heuristic which would give us a basic idea of how good our q values are at the moment. We could for instance just look at the rate of change of our q values and a low rate of change would be associated with good q-values.

The best idea I have is to give rewards for learning the correct Q-values which we add onto the usual rewards. Lets say that for the moment we are working in a finite mdp for which we already have available the real values for each state. Then we could use the real values for each state to give a reward or punishment to the Q-learner based on how close their Q values are to the real values. This would be an awfully interesting way of encouraging exploration. In more general scenarios of course we would not have the true values available (after all if we do have them available why do we care about doing q learning?). So instead we would have to use some sort of heuristic which would give us a basic idea of how good our q values are at the moment. We could for instance just look at the rate of change of our q values and a low rate of change would be associated with good q-values.

### Conservation Laws

Sometimes it seems really amazing to me that energy is conserved exactly. Although it is possible that energy is only conserved approximately but to an incredible degree I think I'm going to go with occam on this one and say that energy must be conserved exactly. Energy is not the only thing though Also rotational and linear momentum and electric charge and baryon number and.... All this stuff remains completely unchanged in time although it wouldn't really surprise me to learn somewhere down the line that baryon number is not really conserved and to a lesser extent I could believe that electric charge might also not be conserved and there are a bunch of other conservation laws that I don't really understand so I should just keep my mouth shut about those particulars.

If you think too long and hard about the conservation of linear momentum you eventually come to the conclusion that it is a consequence of the homogenous nature of space (thank you Emmy Noether) and the conservation of angular momentum is due to the isotropy of space. I often like to think that perhaps our notion of completely closed dimensions within the multiverse is not at all true. In a multiverse it seems to me that the closest dimensions should overlap with one another. There would be no discernible difference between such dimensional neighbors and so there would be no reason for this overlap to cause problems. However I worry sometimes about this world view because it would seem to me that such an overlap between dimensions would suggest a tiny amount of leeway with respect to the conservation laws. Of course the laws would still hold absolutely but it would just be that the conservation would have to be smeared over all universes in order to survive. But such non-conservation events it would seem to me would very probably already have been observed. I don't have a solution though, it is just a troubling challenge for the view I hold of the multiverse.

If you think too long and hard about the conservation of linear momentum you eventually come to the conclusion that it is a consequence of the homogenous nature of space (thank you Emmy Noether) and the conservation of angular momentum is due to the isotropy of space. I often like to think that perhaps our notion of completely closed dimensions within the multiverse is not at all true. In a multiverse it seems to me that the closest dimensions should overlap with one another. There would be no discernible difference between such dimensional neighbors and so there would be no reason for this overlap to cause problems. However I worry sometimes about this world view because it would seem to me that such an overlap between dimensions would suggest a tiny amount of leeway with respect to the conservation laws. Of course the laws would still hold absolutely but it would just be that the conservation would have to be smeared over all universes in order to survive. But such non-conservation events it would seem to me would very probably already have been observed. I don't have a solution though, it is just a troubling challenge for the view I hold of the multiverse.

## Sunday, July 5, 2009

### Aliens and Pictures

It is extremely interesting that our eyes can be so very effective in sorting out spectral information even while they are so easily fooled. I don't exactly know what our eyes do in order to get spectral information but if you stop and think about it for a second it is really pretty amazing that we can fool our eyes into seeing any color we like simply by combining three wavelengths of light with different intensities. It is a rather profoudnly odd thought to realize that all of our nice colorful pictures might turn out to be completely unrecognizable for some alien species whose eyes work on a slightly different heuristic in order to extract spectral data from whatever they are looking at. Line art and the like would no doubt still be recognizable and print would also probably be easily seen correctly (if not understood) but a picture of a flower would likely be percieved by some other entity as some strange color distorted and possibly totally unrecognizable blob of primary colors.

## Saturday, July 4, 2009

### Happy fourth

I'm just kicking back and relaxing today. In fact this is all I'm going to write for my post today. Happy fourth everyone!

## Friday, July 3, 2009

### approximate Q learning

So usually if a q learning problem is interesting it will have too many states to make it practical to keep track of the value of each individual state separately. So in approximate q learning you replace states with a set of features of the state and then you try to learn a value function on those features. So for instance in chess instead of keeping track of the value of each board configuration (since there are about 35^50 of them) you keep track of only a few features of the board configuration. You might for instance keep track of the total number of squares your pieces are attacking and the total number of squares your opponent is attacking. In general if your pieces are attacking a larger number of squares than your opponent then that denotes a board position that is good for you. In the case of normal linear approximation q learning you assume that the value function associated to a feature is a linear function through the origin. So say you find on one q learning update step that your reward and end feature state evaluation give you a value of 10 for a feature value of 5 then you would expect the value function for that feature to be a line with a slope of 2. Generally every time you make an observation you would nudge the slope of the function a bit towards the slope that you observe.

A linear function through the origin does not seem to me to be a particularly hopeful candidate for being able to accurately describe a utility function. In fact if I were to choose a family of one parameter functions to try and describe a utility functions I would probably go for exponential decay. (since sparsity is a good thing maybe this is in fact exactly what I should do) But for my wavelets end of term project last semester I decided that it might be more interesting to instead allow for totally general utility functions but try and make good guesses by assuming that they are sparse in a wavelet basis. With that assumption you could even use compressive sensing in order to make efficient use of your observations and then even if you are wrong about the sparsity of the utility function in the wavelet basis thats ok because you still have the ability to learn an arbitrarily good approximation of that function.

I actually implemented said learning algorithm and used it on the pacman problem but it kinda sucked. My wavelets professor thought that the work was cool enough that I might want to pursue publication of some sort. I have been talking to Hal my AI professor and he also seems to think it is potentially publishable. So possibly the aftermath of the pacman competition could also give me my first publication!

A linear function through the origin does not seem to me to be a particularly hopeful candidate for being able to accurately describe a utility function. In fact if I were to choose a family of one parameter functions to try and describe a utility functions I would probably go for exponential decay. (since sparsity is a good thing maybe this is in fact exactly what I should do) But for my wavelets end of term project last semester I decided that it might be more interesting to instead allow for totally general utility functions but try and make good guesses by assuming that they are sparse in a wavelet basis. With that assumption you could even use compressive sensing in order to make efficient use of your observations and then even if you are wrong about the sparsity of the utility function in the wavelet basis thats ok because you still have the ability to learn an arbitrarily good approximation of that function.

I actually implemented said learning algorithm and used it on the pacman problem but it kinda sucked. My wavelets professor thought that the work was cool enough that I might want to pursue publication of some sort. I have been talking to Hal my AI professor and he also seems to think it is potentially publishable. So possibly the aftermath of the pacman competition could also give me my first publication!

## Wednesday, July 1, 2009

### Q-Learning

A Markov Decision Process MDP is defined by a set of states together with a set of allowed actions for each state a transition model which gives the transition probabilities between states when an action is taken and finally a reward model which gives a numerical reward for any (beginning state, action, end state) triple. A great many things can be described as MDP's Pacman for instance is an MDP. In fact just about every game is an MDP. Games with hidden information are POMDP's that is Partially Observable Markov Decision Processes.

If you want to win a game one way of doing it is treating the game as a MDP and then try to solve for the optimal value function over the states. The optimal value function on the states is the expected total cumulative value of all the rewards you can expect by acting optimally once in that state. Since a lot of games don't have an absolute stopping point beyond which you can't move generally you use a discounted reward model instead which basically means you prefer short term rewards to long term rewards to some extent. To figure out what to do in a MDP you can try to learn the transition model for the states and learn the reward model and from there you can try to solve the MDP to get the optimal value function.

Because trying to learn the transition model as well as the rewards is kind of a pain a popular alternative is to instead try to learn values over all pairs of actions and states, this is called Q learning. In Q-learning what you do is you keep track of what you think the value of any particular state is and then pick an action going from that state and update the value of the state action pair composed of the state you came from and the action you took based both on your current belief about the value of the state you transitioned into (which you can find by taking the maximum of the Q values of the new state) and any rewards you might have gotten along the way. Viola! you have a working means of picking actions in any state, you just look at the Q values and pick the action associated to the one with the highest Q value.

Sorry if this explanation is horrible (because it is) you should probably look it up on wikipedia or something or if you really care then google "sutton and barto reinforcement learning" and you can read a book about it online.

Now my beef with Q learning is this, in game theory you spend a bunch of time figuring out that pure strategies optimally solve only a very small fraction of games. I am talking here about two player zero sum games. Actually my beef is not so much with Q learning in general tasks but specifically in the case of two player zero sum games. Consider the game paper rock scissors. There are three strategies namely the strategy rock the strategy scissors and the strategy paper. A q learner set loose on this game would at first try something at random say paper and win because their opponent chose rock. Now the q-learner thinks that paper is the best strategy because the first time it tried it it won so next time around it tries it again and loses because their opponent chose the strategy scissors. The point is that someone else who knows that their opponent is a q-learner will know that the q-learner will tend to avoid whatever they just chose if they lost with it and tend to pick whatever they just won with over again. So q-learners fail at figuring out RPS. From a game theoretic standpoint what a deterministic q-learner is doing is assuming that every game has a saddle point (which is to say it has an optimal pure strategy). A popular alternative to the deterministic Q learner is a Q learner which chooses between different options by not choosing the "best" Q value but rather choosing from among the different actions using a boltzmann distribution with some temperature. The lower the temperature the closer the Q-learner is to deterministic but with any finite temperature a boltzmann (or softmax as they are actually called) Q learner will learn to play RPS correctly eventually. However the current wisdom is to use the temperature of the boltzmann Q learner as simply a means to allow for convergence and then limiting the temperature to 0 (meaning making the thing become deterministic). But from my perspective the temperature is clearly a feature. In fact I wonder if it might not be the case that picking a finite "equilibrium" temperature for each state would actually yield the correct optimum game theoretic mixed strategy. I don't really have any reason to believe that the distribution over the actions would converge to values in the appropriate way so that you could use the boltzmann distribution to get the correct mixed strategy but it is just so pretty that even if it isn't true it couldn't hurt to apply it and see how it goes.

Now that I have been thinking about it a bit more the pretty idea is sunk or at the very least incomplete. That is because the whole point of optimal mixed strategies is that the expected payoff of following each pure strategy is the same as all the others (assuming that the opponent is playing according to their optimal mixed strategy). So if the Q learner were in fact to come to learn the optimal values for each action/strategy it would learn that they were all the same. And yet we shouldn't allow it to choose between them with equal probability or we would not be acting optimally. I guess even trying to do something fancy like that the normal Q value system is just sunk as far as game theoretic actual optimality goes.

I guess what you would have to do is what my first thought for correcting this problem was. That is you solve a small matrix game for every state. This small matrix would consist of rows of all the actions you could take and then columns consisting of all the actions your opponent could take. You could then fill in the entries in the matrix by getting the values of the resulting states after those actions and then solve the matrix for the optimal strategy. Following the optimal strategy you can then do a one step update from your destination state.

If you want to win a game one way of doing it is treating the game as a MDP and then try to solve for the optimal value function over the states. The optimal value function on the states is the expected total cumulative value of all the rewards you can expect by acting optimally once in that state. Since a lot of games don't have an absolute stopping point beyond which you can't move generally you use a discounted reward model instead which basically means you prefer short term rewards to long term rewards to some extent. To figure out what to do in a MDP you can try to learn the transition model for the states and learn the reward model and from there you can try to solve the MDP to get the optimal value function.

Because trying to learn the transition model as well as the rewards is kind of a pain a popular alternative is to instead try to learn values over all pairs of actions and states, this is called Q learning. In Q-learning what you do is you keep track of what you think the value of any particular state is and then pick an action going from that state and update the value of the state action pair composed of the state you came from and the action you took based both on your current belief about the value of the state you transitioned into (which you can find by taking the maximum of the Q values of the new state) and any rewards you might have gotten along the way. Viola! you have a working means of picking actions in any state, you just look at the Q values and pick the action associated to the one with the highest Q value.

Sorry if this explanation is horrible (because it is) you should probably look it up on wikipedia or something or if you really care then google "sutton and barto reinforcement learning" and you can read a book about it online.

Now my beef with Q learning is this, in game theory you spend a bunch of time figuring out that pure strategies optimally solve only a very small fraction of games. I am talking here about two player zero sum games. Actually my beef is not so much with Q learning in general tasks but specifically in the case of two player zero sum games. Consider the game paper rock scissors. There are three strategies namely the strategy rock the strategy scissors and the strategy paper. A q learner set loose on this game would at first try something at random say paper and win because their opponent chose rock. Now the q-learner thinks that paper is the best strategy because the first time it tried it it won so next time around it tries it again and loses because their opponent chose the strategy scissors. The point is that someone else who knows that their opponent is a q-learner will know that the q-learner will tend to avoid whatever they just chose if they lost with it and tend to pick whatever they just won with over again. So q-learners fail at figuring out RPS. From a game theoretic standpoint what a deterministic q-learner is doing is assuming that every game has a saddle point (which is to say it has an optimal pure strategy). A popular alternative to the deterministic Q learner is a Q learner which chooses between different options by not choosing the "best" Q value but rather choosing from among the different actions using a boltzmann distribution with some temperature. The lower the temperature the closer the Q-learner is to deterministic but with any finite temperature a boltzmann (or softmax as they are actually called) Q learner will learn to play RPS correctly eventually. However the current wisdom is to use the temperature of the boltzmann Q learner as simply a means to allow for convergence and then limiting the temperature to 0 (meaning making the thing become deterministic). But from my perspective the temperature is clearly a feature. In fact I wonder if it might not be the case that picking a finite "equilibrium" temperature for each state would actually yield the correct optimum game theoretic mixed strategy. I don't really have any reason to believe that the distribution over the actions would converge to values in the appropriate way so that you could use the boltzmann distribution to get the correct mixed strategy but it is just so pretty that even if it isn't true it couldn't hurt to apply it and see how it goes.

Now that I have been thinking about it a bit more the pretty idea is sunk or at the very least incomplete. That is because the whole point of optimal mixed strategies is that the expected payoff of following each pure strategy is the same as all the others (assuming that the opponent is playing according to their optimal mixed strategy). So if the Q learner were in fact to come to learn the optimal values for each action/strategy it would learn that they were all the same. And yet we shouldn't allow it to choose between them with equal probability or we would not be acting optimally. I guess even trying to do something fancy like that the normal Q value system is just sunk as far as game theoretic actual optimality goes.

I guess what you would have to do is what my first thought for correcting this problem was. That is you solve a small matrix game for every state. This small matrix would consist of rows of all the actions you could take and then columns consisting of all the actions your opponent could take. You could then fill in the entries in the matrix by getting the values of the resulting states after those actions and then solve the matrix for the optimal strategy. Following the optimal strategy you can then do a one step update from your destination state.

### Artificial Intelligence and Brain Damage

Lets suppose for a moment that we achieve in the next decade or so a series of breakthroughs in nanotechnology and AI to allow the creation of essentially perfect duplications of someones personality as an AI construct. I would be willing to say that if a person receives massive brain damage at some point in this era it would make sense to equip them with something of an artificial brain. Now remember that I said that the reproduction of their personality would be exact and so all of their characteristics would remain intact when this change took place. I would be an advocate of this sort of procedure and actually I would be an advocate of essentially the same procedure even if it was death instead of brain damage that had occurred. For the moment ignoring the moral implications of the resulting overpopulation problems and possible stagnation problems.

But what about people (like my mother) who experienced debilitating brain damage decades ago and has since evolved into someone who is very different from their original self although still sharing similarities? What if you could go in and "correct" their brain and restore their old capacities and old personality while letting them retain their accumulated memories of the intervening time. That would be unfair to the person living now but leaving things as they are is perhaps unfair to the person as they were. The problem is of course not really very much of a problem because the moral dillemma really only has one solution. The person who exists now can choose what they want. I guess I'm just babbling isn't the internet wonderful? I can post crap like this and nobody cares! wait... nobody cares... /sob

But what about people (like my mother) who experienced debilitating brain damage decades ago and has since evolved into someone who is very different from their original self although still sharing similarities? What if you could go in and "correct" their brain and restore their old capacities and old personality while letting them retain their accumulated memories of the intervening time. That would be unfair to the person living now but leaving things as they are is perhaps unfair to the person as they were. The problem is of course not really very much of a problem because the moral dillemma really only has one solution. The person who exists now can choose what they want. I guess I'm just babbling isn't the internet wonderful? I can post crap like this and nobody cares! wait... nobody cares... /sob

### Compressive Sensing

The idea behind compressive sensing is that any interesting signal is not going to look like noise. If you want to reconstruct the signal you probably don't need to sample all the data in the signal but just enough to capture the information in the signal. Say for instance you know the signal is a perfect linear signal. You can just sample two points and be done with it. In essence the idea behind compressive sensing is that real signals tend to be approximately sparse in most bases. Here being approximately sparse in a wavelet basis means that most of the coefficients of the wavelet transformation are either zero or close to zero. The compressive sensing community uses the terminology that a vector is k sparse if the vector has k non zero entries, and approximately k sparse if all but k entries are close to zero. Now the oh so nifty thing about compressive sensing is that you only need something like k*log(N/k) measurements to get the data in the signal where N is the number of coefficients in the signal. What you do is you take your measurements at random, you use these measurements to create constraints on a linear program whose variables are the coefficients of the signal you want to recreate and then minimize the sum of those coefficients.

coding up something like that however would be a bit complicated, at some point it would be great if I were to program something capable of doing the simplex algorithm for linear programming but for the moment it is a project I am not willing to undertake. I wanted to use compressive sensing for my wavelets project at the end of the semester but I ended up not doing it because I couldn't code it (at least in time) and I didn't understand it all that well. What I ended up using was an algorithm that I just made up. The algorithm was relatively easy to code (compared to a general LP solver) and that is essentially why I ended up using it. The algorithm takes a set of direct measurements of a signal and then makes a bad guess for what the signal ought to look like. It then takes this bad guess and takes its wavelet transform. It also takes the wavelet transform of a vector of 1's where the measurements were made. This keeps track of where the wavelet coefficients contain direct information about the signal and where not. Starting at the lowest level of the wavelet transform we look at the transform of the information vector and if it has a non-zero coefficient in a certain place then we do a reverse wavelet transform ignoring the detail coefficients and put the value for that place in the reverse transform in that place in the signal. We continue to do this for every level all the way up until the last step is simply putting the observed measurements in place.

The algorithm as I just described it is pretty sucky, it is slow and inefficient but does give pretty good estimates of the signal. An interesting thing to note however is that iteration of the algorithm with the same observations but starting with the output of the algorithm as the "bad guess" input for the next iteration yields better results. I suppose I haven't really tested how much "better" the results are but the iterated version is totally a compressive sensing algorithm of sorts. Intuitively it makes perfect sense that since we throw away the detail coefficients every time we do an iteration we get closer and closer to a sparse vector in the wavelet basis. I haven't really done a proper analysis so I don't know if that is really true or not. I certainly am not willing to go so far as to say that the iteration converges onto the L1 minimization but I am willing to say that it does in fact do that in at least some simple cases.

coding up something like that however would be a bit complicated, at some point it would be great if I were to program something capable of doing the simplex algorithm for linear programming but for the moment it is a project I am not willing to undertake. I wanted to use compressive sensing for my wavelets project at the end of the semester but I ended up not doing it because I couldn't code it (at least in time) and I didn't understand it all that well. What I ended up using was an algorithm that I just made up. The algorithm was relatively easy to code (compared to a general LP solver) and that is essentially why I ended up using it. The algorithm takes a set of direct measurements of a signal and then makes a bad guess for what the signal ought to look like. It then takes this bad guess and takes its wavelet transform. It also takes the wavelet transform of a vector of 1's where the measurements were made. This keeps track of where the wavelet coefficients contain direct information about the signal and where not. Starting at the lowest level of the wavelet transform we look at the transform of the information vector and if it has a non-zero coefficient in a certain place then we do a reverse wavelet transform ignoring the detail coefficients and put the value for that place in the reverse transform in that place in the signal. We continue to do this for every level all the way up until the last step is simply putting the observed measurements in place.

The algorithm as I just described it is pretty sucky, it is slow and inefficient but does give pretty good estimates of the signal. An interesting thing to note however is that iteration of the algorithm with the same observations but starting with the output of the algorithm as the "bad guess" input for the next iteration yields better results. I suppose I haven't really tested how much "better" the results are but the iterated version is totally a compressive sensing algorithm of sorts. Intuitively it makes perfect sense that since we throw away the detail coefficients every time we do an iteration we get closer and closer to a sparse vector in the wavelet basis. I haven't really done a proper analysis so I don't know if that is really true or not. I certainly am not willing to go so far as to say that the iteration converges onto the L1 minimization but I am willing to say that it does in fact do that in at least some simple cases.

Subscribe to:
Posts (Atom)