YC tries to choose teams, not ideas, so in an ideal world they would fund the same team in 2006 as in 2009. So, it would seem like the idea parts of the application would already be more about the way the applicant talks about their idea rather than what the idea actually is. Or, perhaps one could simply drop the "idea" fields from the data and only train on the "about the team" fields.
Nonetheless, I'd have to see evidence to believe that applications could successfully be classified automatically with any reliability, just because it's such a complex problem. The reason spam and log data and experimental data is classifiable using tools like this is that it is, by it's very nature, relatively predictable. If every spam were written fresh by a different person, and selling a completely different product, it would be impossible to filter it. Likewise, if every time Apache got a new request or error it made up a little prose on the spot about the topic, it'd be pretty difficult for an automated tool to make any of sort of sense out of it.
Nonetheless, I'd have to see evidence to believe that applications could successfully be classified automatically with any reliability, just because it's such a complex problem. The reason spam and log data and experimental data is classifiable using tools like this is that it is, by it's very nature, relatively predictable. If every spam were written fresh by a different person, and selling a completely different product, it would be impossible to filter it. Likewise, if every time Apache got a new request or error it made up a little prose on the spot about the topic, it'd be pretty difficult for an automated tool to make any of sort of sense out of it.