You know all this thinking about q-learning has made me have a new perspective on the nature of delayed rewards. If you go around in psychological circles you will probably hear the term delayed gratification instead of delayed reward which is used in AI. There was a study done wherein children were asked to choose between eating one marshmallow now or getting 2 to eat when the researcher returned in 15 minutes. At any time before the 15 minutes were up the child could ring a bell and the researcher would come back and give the child one marshmallow but would not get the second one. The results of the study showed that those children who held out for the full 15 minutes at age 4 had SAT scores an average of 210 points higher than those 4 year olds who held out for only 30 seconds.
I remember hearing about a study very similar to this although I am pretty sure the study I heard of actually involved IQ tests. When I first heard this sort of result I thought it was interesting though not terribly surprising. Delayed gratification seemed like a perfectly logical effect of higher intelligence. The way I thought about the correlation between the ability to delay gratification and intelligence was simply that smarter people better realized the benefit of the delay. I now think this is very probably almost entirely wrong. The arrow of causation very probably goes completely the other way. After all the desirability of 2 marshmallows over 1 was apparently very clear to these children. It isn't really that only the smart ones figured out that 2 marshmallows is better than 1 or even that (as I thought) that the smarter ones can better understand and judge how much better the 2 marshmallows are. If our brains work anything at all like reinforcement learning machines (and I think they do) then the point is that people with smaller discount factors are smarter! A discount factor is sort of a means by which you can control the level that a reinforcement learning agent prefers rewards sooner to larger later rewards. In theory we don't really want a discount factor at all but in practice it is actually helpful and to some extent necessary to have one. In essence the problem is that if you don't have a discount factor then your reward estimates don't necessarily converge but they do converge if you do have a discount factor. In fact the smaller your discount factor (meaning the more you prefer short term to long term rewards) then the faster the convergence of the value estimates.
So maybe it isn't that delayed gratification is a side effect of higher intelligence. Maybe intelligence is a side effect of delayed gratification!!! In fact I think that a useful (if not necessarily the best) definition of intelligence might be "the ability to delay reward"