Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How are you dealing with the fact common crawl updates their data much less regularly than commercial search engines? And that each update is only a partial refresh?

Edit: And I will say your site design is very nice.



Thank you! We did not plan to regularly update the index. But as it takes only 24 hours to index 1B pages, the easiest way would be to reindex everything, upload it to S3 and update the metadata so the search engine will query the right segments.


We indexed Common Crawl only for the purpose of this demo so this is one-time thing, we won't deal with updates.


Ah I understand you're showcasing the methodology for the underlying index but you're going to open source the engine. I see, great stuff then, super novel and honestly the rest of the open source search engines can definitely use some competition. Love it!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: