Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask PG: Could YC admissions be replaced with a very small shell script?
17 points by dfranke on May 26, 2009 | hide | past | favorite | 12 comments
If you train CRM114 (or some other classifier) on all but the most recent batch of YC applications, how does it fare at distinguishing interview vs. no-interview applications in the most recent batch? How well would it have to do in order to be alarming? :-)

There's no particular reason that I'm asking this other than that I just finished retraining CRM to take advantage of the version upgrade and was pretty impressed with the training results.



We did try this once on startup school applications (after the fact) and it was a pretty bad predictor.

If we ran code like this on YC applications, I suspect the most useful way to use it would be to find groups we hadn't been planning to invite to interviews that deserved a second look.


Huh. For startup school applications that's actually a surprising result. I was under the impression that the admissions process on that was basically just to weed out the hackers from the business people, and with an application that brief I'm surprised that can't be accomplished with a keyword search.


You mean I changed my name to Lisp MacErlang for nothing??

But seriously, the Startup School application is basically just a list of keywords, at least the way I understood it. Are there really people writing essays in there?


That's not the way I understood it. You can't sell your idea, or your team, with a list of keywords.

(Anybody who actually got accepted want to back me up? :-) )


We're talking about startup school, not funding.


Oh... sorry, good point. :-)


I imagine that YC apps reflect the startup trends at the moment. An online video company that quite possibly would have been funded in 2006 would be dead in the water today. With six months between applications, and not that much historical data, this could be hard to compensate for.


YC tries to choose teams, not ideas, so in an ideal world they would fund the same team in 2006 as in 2009. So, it would seem like the idea parts of the application would already be more about the way the applicant talks about their idea rather than what the idea actually is. Or, perhaps one could simply drop the "idea" fields from the data and only train on the "about the team" fields.

Nonetheless, I'd have to see evidence to believe that applications could successfully be classified automatically with any reliability, just because it's such a complex problem. The reason spam and log data and experimental data is classifiable using tools like this is that it is, by it's very nature, relatively predictable. If every spam were written fresh by a different person, and selling a completely different product, it would be impossible to filter it. Likewise, if every time Apache got a new request or error it made up a little prose on the spot about the topic, it'd be pretty difficult for an automated tool to make any of sort of sense out of it.


You could open up the dataset by assigning every word an id, and giving an anonymous dataset of applications in code.


You could probably extract a good deal of information from this by examining word frequency and sentence structure. Or at least, the attempt would be too much for me to resist.


You'd love enough to make the risk of competition all but gone. The most meaningful sentences might have a word used 3 times in a 1000 word corpus. You just can't glean meaning on so little context.

I'd love to try too.


I've applied twice to YC, so my own applications would provide a Rosetta stone for a lot of important words. But that would be cheating.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: