This is a good layman's introduction to modern search techniques, but to someone not in the SEO field it feels like a very strange inversion of priorities. To me, like most people, the surprise is how effective techniques like LDA[1] can be in characterizing a document, but the 'surprise' in the article is that LDA correlates to Google search order better than a more simplistic model.
To a technologically savvy but naive outsider, this might seem obvious: shouldn't pages that rank highly in Google have strong topic-based correlation to pages that the user wants to see? But from the SEO perspective, I guess the conclusion would be that your page is more likely to be ranked highly if it includes all the trappings of other high ranked pages, with, you know, like synonyms and stuff. At a certain point, one has to start thinking, wouldn't it be simpler to make a page that people actually want to find?
Are there good examples of actually useful pages that Google doesn't do a good job of ranking? I occasionally find myself lately getting frustrated with Google about ignoring my rarer search terms, but generally I find the good pages are at the top if they exist at all.
[1] LDA is Latent Dirichlet Allocation, which is very similar to Latent Semantic Analysis, which in turn is very similar to Principle Component Analysis and Singular Value Decomposition. So it's possible you've already heard of the concept, but coming from another angle in another field.
> Are there good examples of actually useful pages that Google doesn't do a good job of ranking?
Yes. Reviews by actual people.
A long, long time ago (last year I think) I used to use google blogsearch to get at the "other side" of hardware reviews: what people who are not being paid to review hardware think of a specific piece of hardware.
Doesn't work any more. There's almost nothing but spam. It's hopeless.
I agree. There are definitely cases where spam has displaced good results. Although poorly phrased, I guess I was asking a different question: apart from intentional attempts to game the system, are there cases where pages one wants to see are consistently ranked after pages one does not? Which is to say, are there improvements that could be made to Google's ranking algorithms beyond making it more spam resistant?
I though it was kind of odd they did LDA rather than something more broadly used (eg LSA).
But I've never really looked at LDA, and Wikipedia says: Compared to standard latent semantic analysis which stems from linear algebra and downsizes the occurrence tables (usually via a singular value decomposition), probabilistic latent semantic analysis is based on a mixture decomposition derived from a latent class model. This results in a more principled approach which has a solid foundation in statistics. so maybe they made the right choice. (Not that I see what "solid foundation in statistics" really means in this context)
LSA is relatively similar in some abstract sense to the published pagerank algorithm. LDA is more powerful, in the sense that it can account for more complex relationships (but may be less accurate with large number of data - I have really no idea how those would scale and compare at google-like size).
My understanding of LDA is that it gives you document scores against queries based on the topics extracted using the LDA algorithm on the text in the page.
Pagerank, on the other hand scores based on external pointers (ie, references) but doesn't have anything to do with the text on the page.
It's the abstract sense that is important. Pagerank is a dimensionality reduction technique. It finds the first eigenvector of the transition matrix. Eigenvectors = PCA. LSI is basically PCA, but applied to the document-term matrix. LDA is a dimensionality reduction technique that makes use of more information.
For some applications Dirichlet mixtures are even better. I used this tool, which I can recommend, for deriving the parameters of the mixtures
http://chasen.org/~daiti-m/dist/dm/
I think "solid foundation in statistics" refers to the fact that LDA has a proper generative model of documents and topics, while LSA doesn't -- so that it's easier to reason about LDA and to build more complex models on top of it.
They have found a correlation between a set of words related to a topic you are searching for and how highly a search engine ranks that page?
Well duh! Did anyone really think search engines did a keyword search and then applied Pagerank/HITS (http://en.wikipedia.org/wiki/HITS_algorithm) or whatever? That would give dreadful results.
I build the vector space classifier in http://classifier4j.sf.net based almost entirely on that article, even though I don't know Perl. It's very readable, and gives you a great understanding.
The news isn't that there is a correlation but that there is such a strong correlation. There are a bunch of specific techniques Google could be using and it looks likely that this is close to what they actually use.
They also use a lot of other ranking factors beyond just the words on the page so seeing such a high correlation from a "bag of words" model is pretty interesting (to me at least).
If I've read the graph right, it's about 0.33. For a Pearson (product-moment) correlation coefficient, that would mean that about 10% of the variance in Google rankings is explained by a linear regression on LDA scores. They've actually used the Spearman (ranking-based) correlation coefficient, which is equivalent to ranking all the values of each variable from 1..N and then computing the Pearson correlation coefficient for the ranks. So, kinda-sorta with lots of handwaving, that means that about 10% of the ordering of the Google rankings is explained by the LDA scores.
Clearly that's a lot better than for the other scoring methods they mentioned, and that probably indicates that Google are doing something a bit like LDA (but this will be true for any approach that takes note of synonyms, and it's hardly news that Google do that). But it doesn't, e.g., suggest that PageRank and other things based on link structure aren't extremely important to Google's rankings.
Google talks about using over 200 ranking factors so I think a correlation this high with a single factor is actually quite interesting. Especially given that we are seeing a correlation with a model that is undoubtedly more naive than what Google is actually using.
It is also fascinating to see this much correlation with an on-page factor which is entirely in the webmaster's control. Previously the highest correlations had been with link metrics.
YMMV but this research is interesting to me as someone who works in this field.
I don't know if I saw my comment (looks like we posted at similar times). I'd be interested in your comments relative to that: there are a lot of factors this is quite a high correlation for a single one... (I think).
Did you test non-LDA methods? Because to me it looked like a correlation between a set of related words and ranking for a topic related to those words.
Without testing non LDA methods I can't see what you've proved.
No, of course not. But if you understand how cosine similarity works then you are 90% (99%?) of the way to understanding LSA. I'm not sure about LDA, as I haven't read enough.
I sometimes use LDA (using Hadoop and Mahout) and it is not an inexpensive calculation for large document sets). I wonder what the costs are for using this large scale.
To a technologically savvy but naive outsider, this might seem obvious: shouldn't pages that rank highly in Google have strong topic-based correlation to pages that the user wants to see? But from the SEO perspective, I guess the conclusion would be that your page is more likely to be ranked highly if it includes all the trappings of other high ranked pages, with, you know, like synonyms and stuff. At a certain point, one has to start thinking, wouldn't it be simpler to make a page that people actually want to find?
Are there good examples of actually useful pages that Google doesn't do a good job of ranking? I occasionally find myself lately getting frustrated with Google about ignoring my rarer search terms, but generally I find the good pages are at the top if they exist at all.
[1] LDA is Latent Dirichlet Allocation, which is very similar to Latent Semantic Analysis, which in turn is very similar to Principle Component Analysis and Singular Value Decomposition. So it's possible you've already heard of the concept, but coming from another angle in another field.