You could start a search engine with data from commoncrawl, maybe you can even g...

You could start a search engine with data from commoncrawl, maybe you can even get other projects like archive.is to ship you some hard drives. Then you just need to build an index and serve search queries; plenty of open source search engines have been attempted, giving good Templates or even directly usable implementations.

The hard thing is distinguishing personal blogs from blogspam and other worthless content. Performance is a huge issue since you want to spend at most double digit milliseconds per page, but maybe it's getting viable with ML becoming commoditized. But getting this perfect would make or break the project.