Publishers like The Guardian and NYT are blocking the IA/Wayback Machine. 20% of...

trollbridge · 2026-02-14T19:45:04 1771098304

And whilst the IA will honour requests not to archive/index, more aggressive scrapers won't, and will disguise their traffic as normal human browser traffic.

So we're basically decided we only want bad actors to be able to scrape, archive, and index.

mananaysiempre · 2026-02-15T15:58:57 1771171137

> If you find yourself wondering, or just feeling, "Why is everyone I wind up dealing with an asshole?" you might want to consider the possibility that you have set up an asshole filter.

https://siderea.dreamwidth.org/1209794.html

trollbridge · 2026-02-18T15:28:04 1771428484

It’s hard to filter out legit looking traffic tho.

JumpCrisscross · 2026-02-14T20:46:14 1771101974

> we're basically decided we only want bad actors to be able to scrape, archive, and index

AI training will be hard to police. But a lot of these sites inject ads in exchange for paywall circumvention. Just scanning Reddit for the newest archive.is or whatever should cut off most of the traffic.

fc417fc802 · 2026-02-14T19:47:51 1771098471

Presumably someone has already built this and I'm just unaware of it, but I've long thought some sort of crowd sourced archival effort via browser extension should exist. I'm not sure how such an extension would avoid archiving privileged data though.

ajb · 2026-02-14T21:08:28 1771103308

That exists for court documents (RECAP) but I think they didn't have to solve the issue of privilege as PACER publishes unprivileged docs.

nxobject · 2026-02-15T00:26:27 1771115187

In particular, habeas petitions against DHS, and SSA appeals aren’t available online for public inspection: you have to go to a clerk’s office and pay for physical copies. (I think this may have been reasonable given the circumstances in past decades… not so now.)

nullhole · 2026-02-15T00:01:14 1771113674

Can you give a reference for The Guardian blocking IA? I just checked with an article from today - already archived, and a manual re-archive worked.

ccgreg · 2026-02-15T01:45:06 1771119906

That 20% number is for a limited list of relatively large news websites. If you include the long tail of news, the % of blocking is much smaller.

username223 · 2026-02-15T02:18:14 1771121894

I'm part of that small but (hopefully) growing percentage, because Common Crawl is a deeply dishonest front for AI data scraping. Quoting Wikipedia:

""" In November 2025, an investigation by technology journalist Alex Reisner for The Atlantic revealed that Common Crawl lied when it claimed it respected paywalls in its scraping and requests from publishers to have their content removed from its databases. It included misleading results in the public search function on its website that showed no entries for websites that had requested their archives be removed, when in fact those sites were still included in its scrapes used by AI companies. """

My site is CC-BY-NC-SA, i.e. non-commercial and with attribution, and Common Crawl took a dubious position on whether fair use makes that irrelevant. They can burn.

ccgreg · 2026-02-15T02:30:55 1771122655

Did you see our reply? https://commoncrawl.org/blog/setting-the-record-straight-com...

Also, if your site has CC-BY-NC-SA markings, we have preserved them.

username223 · 2026-02-15T02:37:53 1771123073

Hopefully my site is no longer part of Common Crawl. I'm not interested in participating in your project, block CCBot in robots.txt, and have requested deletion of my data via your form.

ccgreg · 2026-02-15T02:40:55 1771123255

Did you see our reply? Edit: by which I mean, we sent you an email that explains what we did and how to verify it. Did you not receive an email reply? If not, please contact us again.

Also, if your site has CC-BY-NC-SA markings, we have preserved them.

username223 · 2026-02-15T03:50:16 1771127416

I don't care. Is blocking your bot and requesting removal sufficient? If not, what is?

ccgreg · 2026-02-15T04:09:07 1771128547

Please read our email reply. I have no idea if we received your request —- your HN username doesn’t match any request we have received.

username223 · 2026-02-15T05:58:04 1771135084

"We have initiated the process to remove your content from the Common Crawl Dataset. This is a multi-step process, involving first a nocrawl directive, followed by removal of the URLs from the primary index files, and finally removal of the content from the deep archive. We will advise when the process is complete." Received April 2024. I have not been advised. Please advise.

ccgreg · 2026-02-15T03:38:14 1771126694

Oh, and thanks for letting me know that I need to add our reply to Wikipedia.

samtheDamned · 2026-02-15T18:48:34 1771181314

From my basic experience editing Wikipedia I'm not sure you should edit the page of your own project. Maybe add a discussion for it instead? Or perhaps I'm mistaken.