We did try this once on startup school applications (after the fact) and it was a pretty bad predictor.
If we ran code like this on YC applications, I suspect the most useful way to use it would be to find groups we hadn't been planning to invite to interviews that deserved a second look.
Huh. For startup school applications that's actually a surprising result. I was under the impression that the admissions process on that was basically just to weed out the hackers from the business people, and with an application that brief I'm surprised that can't be accomplished with a keyword search.
You mean I changed my name to Lisp MacErlang for nothing??
But seriously, the Startup School application is basically just a list of keywords, at least the way I understood it. Are there really people writing essays in there?
I imagine that YC apps reflect the startup trends at the moment. An online video company that quite possibly would have been funded in 2006 would be dead in the water today. With six months between applications, and not that much historical data, this could be hard to compensate for.
YC tries to choose teams, not ideas, so in an ideal world they would fund the same team in 2006 as in 2009. So, it would seem like the idea parts of the application would already be more about the way the applicant talks about their idea rather than what the idea actually is. Or, perhaps one could simply drop the "idea" fields from the data and only train on the "about the team" fields.
Nonetheless, I'd have to see evidence to believe that applications could successfully be classified automatically with any reliability, just because it's such a complex problem. The reason spam and log data and experimental data is classifiable using tools like this is that it is, by it's very nature, relatively predictable. If every spam were written fresh by a different person, and selling a completely different product, it would be impossible to filter it. Likewise, if every time Apache got a new request or error it made up a little prose on the spot about the topic, it'd be pretty difficult for an automated tool to make any of sort of sense out of it.
You could probably extract a good deal of information from this by examining word frequency and sentence structure. Or at least, the attempt would be too much for me to resist.
You'd love enough to make the risk of competition all but gone. The most meaningful sentences might have a word used 3 times in a 1000 word corpus. You just can't glean meaning on so little context.
If we ran code like this on YC applications, I suspect the most useful way to use it would be to find groups we hadn't been planning to invite to interviews that deserved a second look.