The Octopart team has done a great job with HNSearch, and we really appreciate t...

zissou · on Dec 1, 2013

Thanks for the info and I do look forward to the new API!

However my question still remains with regards to historic posts/comments. The historic aspect is really the import element here. Generally speaking, building an ngram viewer requires a collection of texts over time, with each text having some kind of metadata that is categorical, boolean, datetime, or numeric. Categorical data can always be made of numeric data by creating bins -- i.e. posts by people with karma or a ranking of 1-50, 51-150, 151-300, etc at the time the comment/post was created. Datetimes can also be made into useful categorical variables for an ngram viewer such as day of the week (to spot weekly seasonality trends) or day of the year (annual seasonality trends).

If I was allowed, I would be willing to write a scraper/crawler to discover as many historic threads (since: threads -> comments) as possible using HNSearch, but this could take a long time depending on rate limits and/or be subject to unknown biases within my discovery method. I'm sure you can understand why a "top-down" approach like a database dump would make for a much higher quality corpus than attempting the "bottom-up" approach of a crawler. I have no idea if a "database dump of everything" is even feasible as I don't know anything about the HN's backend infrastructure. However, if it is feasible, then I'm certain that I can work with whatever would be available. Adding structure to unstructured data is my bread and butter.

I really think this would be a very cool tool that a lot of people would enjoy, so I'm willing to do what is needed on my end to help make it work. After all, I'd be on the clock while working on this rather than just a hobby project, so the incentives are definitely aligned on my end.

If you want to discuss anything in private, I can be reached at the following reversed address: moc{dot}liamg{at}yalkcin{dot}wehttam