Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> But the truth is that all kinds of things disappear all the time in all aspects of life. The web is no different at all.

Glad we have the Wayback Machine then. But if you don't want your blog mirrored by Wayback you can declare that in your `robots.txt` file. Do this:

    User-agent: *
    Disallow: /
But that doesn't mean crawlers/bots will honor that request and presume any content you post publicly will be backed up somewhere. If not somewhere on the net, then on someone's hard-drive!


The Internet Archive does not respect robots.txt - https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...


The blog post you are linking is outdated. They are honoring robots.txt files. From the FAQ:

> Some sites are not available because of robots.txt or other exclusions. What does that mean? Such sites may have been excluded from the Wayback Machine due to a robots.txt file on the site or at a site owner’s direct request.

If you exclude them in your robots.txt file they will also absolutely retroactively remove your site from the index.

- https://news.ycombinator.com/item?id=16965575

- https://help.archive.org/help/using-the-wayback-machine/


I would absolutely love an option that meant "archive and make available forever from this point backwards" to protect against domain expirations and re-registration (possibly by domain squatters or content farms).


I hope you're right! The lack of an update on that post, combined with the FAQ saying the opposite thing, makes it even harder for me to know what their policy is. Respecting robots.txt is a civilized thing to do and I hope they do it.


I hope they don't. If you don't want things archived, don't put them out there.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: