I have spent a great deal of time over the past few months thinking about Q-learning and other types of reinforcement learning. In a reinforcement learning problem you have a reward function which is given to to you as part of the problem and then you go about trying to maximize your reward. When it comes down to it though ultimately the reward function is something that a general AI would have to come up with on its own. As humans we are capable of figuring out what is good and what is bad and by how much. the reward function of each individual is rather unique, one person might find reading superman comics highly rewarding and another find it to be a complete waste of time. Although I don't really have much to say as for what one should do to try and create a good general reward function for human level intelligence. That doesn't mean, however, that I don't have some interesting ideas that could be turned into interesting experiments.
The best idea I have is to give rewards for learning the correct Q-values which we add onto the usual rewards. Lets say that for the moment we are working in a finite mdp for which we already have available the real values for each state. Then we could use the real values for each state to give a reward or punishment to the Q-learner based on how close their Q values are to the real values. This would be an awfully interesting way of encouraging exploration. In more general scenarios of course we would not have the true values available (after all if we do have them available why do we care about doing q learning?). So instead we would have to use some sort of heuristic which would give us a basic idea of how good our q values are at the moment. We could for instance just look at the rate of change of our q values and a low rate of change would be associated with good q-values.