We did try this once on startup school applications (after the fact) and it was ...

dfranke · on May 26, 2009

Huh. For startup school applications that's actually a surprising result. I was under the impression that the admissions process on that was basically just to weed out the hackers from the business people, and with an application that brief I'm surprised that can't be accomplished with a keyword search.

Alex3917 · on May 26, 2009

You mean I changed my name to Lisp MacErlang for nothing??

But seriously, the Startup School application is basically just a list of keywords, at least the way I understood it. Are there really people writing essays in there?

javert · on May 26, 2009

That's not the way I understood it. You can't sell your idea, or your team, with a list of keywords.

(Anybody who actually got accepted want to back me up? :-) )

dfranke · on May 26, 2009

We're talking about startup school, not funding.

javert · on May 27, 2009

Oh... sorry, good point. :-)

fizx · on May 26, 2009

I imagine that YC apps reflect the startup trends at the moment. An online video company that quite possibly would have been funded in 2006 would be dead in the water today. With six months between applications, and not that much historical data, this could be hard to compensate for.

SwellJoe · on May 26, 2009

YC tries to choose teams, not ideas, so in an ideal world they would fund the same team in 2006 as in 2009. So, it would seem like the idea parts of the application would already be more about the way the applicant talks about their idea rather than what the idea actually is. Or, perhaps one could simply drop the "idea" fields from the data and only train on the "about the team" fields.

Nonetheless, I'd have to see evidence to believe that applications could successfully be classified automatically with any reliability, just because it's such a complex problem. The reason spam and log data and experimental data is classifiable using tools like this is that it is, by it's very nature, relatively predictable. If every spam were written fresh by a different person, and selling a completely different product, it would be impossible to filter it. Likewise, if every time Apache got a new request or error it made up a little prose on the spot about the topic, it'd be pretty difficult for an automated tool to make any of sort of sense out of it.

ivankirigin · on May 27, 2009

You could open up the dataset by assigning every word an id, and giving an anonymous dataset of applications in code.

dfranke · on May 27, 2009

You could probably extract a good deal of information from this by examining word frequency and sentence structure. Or at least, the attempt would be too much for me to resist.

ivankirigin · on May 27, 2009

You'd love enough to make the risk of competition all but gone. The most meaningful sentences might have a word used 3 times in a 1000 word corpus. You just can't glean meaning on so little context.

I'd love to try too.

dfranke · on May 27, 2009

I've applied twice to YC, so my own applications would provide a Rosetta stone for a lot of important words. But that would be cheating.