How are you dealing with the fact common crawl updates their data much less regu...

francoismassot · on May 7, 2021

Thank you! We did not plan to regularly update the index. But as it takes only 24 hours to index 1B pages, the easiest way would be to reindex everything, upload it to S3 and update the metadata so the search engine will query the right segments.

guilload · on May 7, 2021

We indexed Common Crawl only for the purpose of this demo so this is one-time thing, we won't deal with updates.

ianbutler · on May 7, 2021

Ah I understand you're showcasing the methodology for the underlying index but you're going to open source the engine. I see, great stuff then, super novel and honestly the rest of the open source search engines can definitely use some competition. Love it!