Latent Dirichlet Allocation Surprisingly Well Correlated w/ Google Rankings

nkurz · on Sept 7, 2010

This is a good layman's introduction to modern search techniques, but to someone not in the SEO field it feels like a very strange inversion of priorities. To me, like most people, the surprise is how effective techniques like LDA[1] can be in characterizing a document, but the 'surprise' in the article is that LDA correlates to Google search order better than a more simplistic model.

To a technologically savvy but naive outsider, this might seem obvious: shouldn't pages that rank highly in Google have strong topic-based correlation to pages that the user wants to see? But from the SEO perspective, I guess the conclusion would be that your page is more likely to be ranked highly if it includes all the trappings of other high ranked pages, with, you know, like synonyms and stuff. At a certain point, one has to start thinking, wouldn't it be simpler to make a page that people actually want to find?

Are there good examples of actually useful pages that Google doesn't do a good job of ranking? I occasionally find myself lately getting frustrated with Google about ignoring my rarer search terms, but generally I find the good pages are at the top if they exist at all.

[1] LDA is Latent Dirichlet Allocation, which is very similar to Latent Semantic Analysis, which in turn is very similar to Principle Component Analysis and Singular Value Decomposition. So it's possible you've already heard of the concept, but coming from another angle in another field.

nodata · on Sept 7, 2010

> Are there good examples of actually useful pages that Google doesn't do a good job of ranking?

Yes. Reviews by actual people.

A long, long time ago (last year I think) I used to use google blogsearch to get at the "other side" of hardware reviews: what people who are not being paid to review hardware think of a specific piece of hardware.

Doesn't work any more. There's almost nothing but spam. It's hopeless.

nkurz · on Sept 7, 2010

I agree. There are definitely cases where spam has displaced good results. Although poorly phrased, I guess I was asking a different question: apart from intentional attempts to game the system, are there cases where pages one wants to see are consistently ranked after pages one does not? Which is to say, are there improvements that could be made to Google's ranking algorithms beyond making it more spam resistant?

nl · on Sept 7, 2010

I though it was kind of odd they did LDA rather than something more broadly used (eg LSA).

But I've never really looked at LDA, and Wikipedia says: Compared to standard latent semantic analysis which stems from linear algebra and downsizes the occurrence tables (usually via a singular value decomposition), probabilistic latent semantic analysis is based on a mixture decomposition derived from a latent class model. This results in a more principled approach which has a solid foundation in statistics. so maybe they made the right choice. (Not that I see what "solid foundation in statistics" really means in this context)

cdavid · on Sept 7, 2010

LSA is relatively similar in some abstract sense to the published pagerank algorithm. LDA is more powerful, in the sense that it can account for more complex relationships (but may be less accurate with large number of data - I have really no idea how those would scale and compare at google-like size).

nl · on Sept 7, 2010

Can you explain this some more?

My understanding of LDA is that it gives you document scores against queries based on the topics extracted using the LDA algorithm on the text in the page.

Pagerank, on the other hand scores based on external pointers (ie, references) but doesn't have anything to do with the text on the page.

noelwelsh · on Sept 7, 2010

It's the abstract sense that is important. Pagerank is a dimensionality reduction technique. It finds the first eigenvector of the transition matrix. Eigenvectors = PCA. LSI is basically PCA, but applied to the document-term matrix. LDA is a dimensionality reduction technique that makes use of more information.

nl · on Sept 8, 2010

Oh, I see.

I thought you were talking about some functional similarities, not the mathematical similarities.

yread · on Sept 7, 2010

For some applications Dirichlet mixtures are even better. I used this tool, which I can recommend, for deriving the parameters of the mixtures http://chasen.org/~daiti-m/dist/dm/

jkan · on Sept 7, 2010

I think "solid foundation in statistics" refers to the fact that LDA has a proper generative model of documents and topics, while LSA doesn't -- so that it's easier to reason about LDA and to build more complex models on top of it.

equark · on Sept 7, 2010

It means there is a likelihood function.

moultano · on Sept 7, 2010

All good ranking functions are pretty correlated. There are many ways for a ranking to be bad, and few ways for it to be good.

nl · on Sept 7, 2010

This is news? Seriously????

They have found a correlation between a set of words related to a topic you are searching for and how highly a search engine ranks that page?

Well duh! Did anyone really think search engines did a keyword search and then applied Pagerank/HITS (http://en.wikipedia.org/wiki/HITS_algorithm) or whatever? That would give dreadful results.

If you really want to understand this, I recommend Building a Vector Space Search Engine in Perl http://perl.about.com/b/2007/05/24/building-a-vector-space-s...

I build the vector space classifier in http://classifier4j.sf.net based almost entirely on that article, even though I don't know Perl. It's very readable, and gives you a great understanding.

will_critchlow · on Sept 7, 2010

The news isn't that there is a correlation but that there is such a strong correlation. There are a bunch of specific techniques Google could be using and it looks likely that this is close to what they actually use.

They also use a lot of other ranking factors beyond just the words on the page so seeing such a high correlation from a "bag of words" model is pretty interesting (to me at least).

gjm11 · on Sept 7, 2010

The correlation really isn't all that high.

If I've read the graph right, it's about 0.33. For a Pearson (product-moment) correlation coefficient, that would mean that about 10% of the variance in Google rankings is explained by a linear regression on LDA scores. They've actually used the Spearman (ranking-based) correlation coefficient, which is equivalent to ranking all the values of each variable from 1..N and then computing the Pearson correlation coefficient for the ranks. So, kinda-sorta with lots of handwaving, that means that about 10% of the ordering of the Google rankings is explained by the LDA scores.

Clearly that's a lot better than for the other scoring methods they mentioned, and that probably indicates that Google are doing something a bit like LDA (but this will be true for any approach that takes note of synonyms, and it's hardly news that Google do that). But it doesn't, e.g., suggest that PageRank and other things based on link structure aren't extremely important to Google's rankings.

will_critchlow · on Sept 7, 2010

Google talks about using over 200 ranking factors so I think a correlation this high with a single factor is actually quite interesting. Especially given that we are seeing a correlation with a model that is undoubtedly more naive than what Google is actually using.

It is also fascinating to see this much correlation with an on-page factor which is entirely in the webmaster's control. Previously the highest correlations had been with link metrics.

YMMV but this research is interesting to me as someone who works in this field.

noelwelsh · on Sept 7, 2010

This post should be upvoted a zillion times. The correlations they report are really quite low and as such their claims are really quite bogus.

will_critchlow · on Sept 7, 2010

I don't know if I saw my comment (looks like we posted at similar times). I'd be interested in your comments relative to that: there are a lot of factors this is quite a high correlation for a single one... (I think).

nl · on Sept 7, 2010

(Thanks for the reply)

Did you test non-LDA methods? Because to me it looked like a correlation between a set of related words and ranking for a topic related to those words.

Without testing non LDA methods I can't see what you've proved.

will_critchlow · on Sept 7, 2010

Yes. The chart in the post shows how low the correlation is for tf-idf. I believe the original also showed similarly poor results for LSI etc.

iamwil · on Sept 7, 2010

http://www.perl.com/pub/2003/02/19/engine.html

Actual link

fizx · on Sept 7, 2010

Simple cosine similarity is not the same as LDA.

nl · on Sept 7, 2010

No, of course not. But if you understand how cosine similarity works then you are 90% (99%?) of the way to understanding LSA. I'm not sure about LDA, as I haven't read enough.

mark_l_watson · on Sept 7, 2010

I sometimes use LDA (using Hadoop and Mahout) and it is not an inexpensive calculation for large document sets). I wonder what the costs are for using this large scale.

madridorama · on Sept 7, 2010

I'm sorry but this is overthinking something that is relatively simple to understand