Wednesday, January 13, 2010

My Web Identity Equity

I just put "Tim Anderton" (my name) into google and hit search. A quick perusal of the results is interesting but not terribly heartening. Although results that actually have to do with me at least happen to be on the first page. Results 2, 7, and 8 on the first page were related to me. Not surprisingly the most prominent result was the course web-page for the class that I was a TA for last semester. The next results were similar 7 was a page showing past REU participants and 8 was my facebook page.

Sifting through the first hundred results drops the percentage of relevant pages down from 30% to 17% which is still surprisingly high. But simply taking the percentage of items which are related to me is rather naive since it doesn't take into account the order of results. Clearly if the 17 results actually related to me were all at the end of the 100 results then that would be a much weaker net presence than if they were the first 17. How exactly I should deal with the strength of the order dependence is unclear.

After some thought I have settled on a system for measuring this quality of prominence, For lack of a better name I shall call it Web Identity Equity. Of course this is not restricted to persons if you enter the term "Thermal Physics" and then looked through the results to see where the textbook of that name by Kittel and Kroemer falls you could measure the identity equity of that book on the web thus the name. Instead of having the Web Identity Equity (WIE) represented as just a single number I have decided it is better to treat it as a vector.

The first element of the vector should definitely be the percentage. Even though it doesn't take into account any sort of ordering it is easy to interpret and gives good information about the density of results.

After a little deliberation the second element of the vector I decided should be a weighted sum with weights of 1 over the search rank. So the first search result gets weight 1 the second result gets weight 1/2 the third weight 1/3 and so on. This has the advantage that early results are greatly favored over later results but there is not much difference between the importance of the 14th and 15th result. In order to make the number nicer to process and compare the sum should be normalized. The sum of 1/n from 1 to N limits to Ln(N) + gamma where gamma is the Euler constant So not caring to carry out the normalization exactly I will just divide the result of the sum of the reciprocals of the relevant rankings by Ln(N) + gamma and call it close enough. The error is very small for any appreciable number of results the effect is about 2% for 10 results and is already 0.1% for 100 results so the effect wouldn't be noticeable anyway.

The 1/r weighting however falls off extremely slowly for the tail end of the curve as the fact that the series diverges attests. So the natural next step is to use the weighting 1/r^2 (r is the ranking of the result of course). This has the nice property that the sum of the ranking weights converges for an infinite number of results to Pi^2/6. Since now the tail end of the distribution is given in some sense zero weight this number represents a sort of heavy favoring of early search results.

Finally for the fourth number I feel that I need to take into account the fact that people will often not go to the next page of search results if they can help it and results on the same page will often share similar perusal. So for the last number we take a weighted average of the density of relevant results on each page. Since moving from page to page is much less likely than just looking at the next result I feel that the weights should have a strong bias for the early result pages. But unlike for the earlier measures I feel that here it is appropriate to treat the number of pages visited as a poisson distribution. Somewhat obviously the mean of the poisson distribution should be greater than one since you can't visit less than 1 search page and the probability of visiting more than 1 should be nonzero but the mean number of search pages is definitely less than 2 and if we just let the mean be 1 the appropriate weighting is rather nicely just 1/r! (that is 1 over r factorial).

it is tempting to include an extremely rank dependent ranking system that models the search depth as a poisson distributed parameter with some larger mean but frankly it is too much of a pain to calculate and normalize for large numbers of results unless I was willing to write a script. I might do that in the future but for the moment this will have to do.

Now getting back to the actual search results for my name. My Web Identity Equity Vector using the first 100 google results is

0.170, 0.212, 0.181, 0.252

I can't help but be somewhat amazed by the consistency in the different numbers. Apparently my web identity equity is about 20% no matter how you measure it.

Of course while fun this is only really interesting when you have other things to compare it with. This sort of thing would be perfect to write a little web script for so that you could do it for any random thing you felt like but alas I have no such skill.