Hacker Newsnew | past | comments | ask | show | jobs | submit | walls's commentslogin

A huge amount of the web is only crawlable with a googlebot user-agent and specific source IPs.


> And given you-know-what, the battle to establish a new search crawler will be harder than ever. Crawlers are now presumed guilty of scraping for AI services until proven innocent.

I have always wondered but how does wayback machine work, is there no way that we can use wayback archive and then run a index on top of every wayback archive somehow?


You can read https://hackernoon.com/the-long-now-of-the-web-inside-the-in... it was a nice look into their infra structure. One could theoretically build it. A few things stand out:

1. IIUC depends a lot on "Save Page Now" democratization, which could work, but its not like a crawler.

2. In absence of alexa they depend quite heavily on common crawl, which is quite crazy because there literally is no other place to go. I don't think they can use google's syndicated API, cause they would then start showing ads in their database, which is garbage that would strain their tiny storage budget.

3. Minor from a software engineering perspective but important for survival of the company: since they are an artifact of record storage, to convert that to an index would need a good legal team to battle google to argue. They do that the DoJ's recent ruling in their favor.


I do not know a lot about this subject, but couldn’t you make a pretty decent index off of common crawl? It seems to me the bar is so low you wouldn’t have to have everything. Especially if your goal was not monetization with ads.


I think someone had commented on another thread about SerpAPI the other day that common crawl is quite small. It would be a start, I think the key to a good index people will use is freshness of the results. You need good recall for a search engine, precision tuning/re-ranking is not going to help otherwise.


Are these websites not serving public content? If there's some legal concerns just create a separate scraping LLC that fakes user agent and uses residential IPs or VPN or something. I can't imagine that the companies would follow through with some sort of lawsuit against a scraper that's trying to index their site to get them more visitors, if they allow GoogleBot.


Isnt that what SerpAPI was doing?


If a crawler offered enough money they could be allowed too. It's not like Google has exclusive crawling rights.


There is a logistics problem here - even if you had enough money to pay, how would you get in touch with every single site to even let them know you're happy to pay? It's not like site operators routinely scan their error logs to see your failed crawling attempts and your offer in the user-agent.

Even if they see it, it's a classic chicken & egg problem: it's not worth the time of the site operator to engage with your offer until your search engine popular enough to matter, but your search engine will never become popular enough to matter if it doesn't have a critical mass of sites to begin with.


Realistically you don't need every single site on board before you index becomes valuable. You can get in touch with sites via social media, email, discord, or even visiting them face to face.


You really do need every single site, as search is a long tail problem. All the interesting stuff is in the fringes, if you only have a few big sites you'll have a search engine of spam.


I think that is only needed for a small subset of queries. Seriously think of the last time you did a search and went to a fringe site as opposed to a well known brand or social media. Ranking quality is much more important than coverage over the whole internet.


> Seriously think of the last time you did a search and went to a fringe site as opposed to a well known brand or social media.

Oh, almost never. That's exactly why search sucks now.


io is actually a country code tld for a territory that no longer exists.


I was aware of that, there was some controversy regarding that TLD a while back.

I consider it an exotic one though, just like .tv (which is also a country TLD) or .ai (not a country TLD).


Nobody is coming to help.


"All of the people I know who were friends with this sociopathic child-trafficking pedophile told me he was reformed now" is certainly something to put out there.


They were sealed at the time, and the admin was following the law.


Yep, in certain cases requests/responses will only show up in the Fetch domain, sometimes only in the Network domain, and sometimes neither!


The DMV isn't exactly validating heights, I think we need more evidence.


Your mistake is thinking that the ordinary person is sitting on a pile of cash.

The median person in the US has about $8k, which is basically enough to cover one emergency.


dang is, at best, oblivious to the fact that that this site has become a battleground. At worst, he's intentionally chosen sides with his selective removal of flags.


> The land of 70k jobs

What are these 70k jobs in the south?


All over in the right professions. Accountant, probably basically everywhere for example. I imagine it wouldn't be hard to pull 70k doing b2b sales. Plenty other white collar work probably gets you there too. Skilled trades also very well compensated today and in demand as well.


Raleigh-Durham Metro, Metro Houston, Metro DFW (imo not the South), Charlotte Metro, and Metro Atlanta off the top of my head and based on median household incomes.

That said, assuming you could afford a 2k square ft house with a backyard in a highly desirable neighborhood similar to what Palo Alto is today on an average person's salary 50 years ago doesn't seem realistic.

Also, 50 years ago, redlining and race as well as gender based discrimination in most jobs was the norm, so unless you were a white (which itself was a narrower term than today) man, there was a glass ceiling, and most jobs that were supposedly high paying in reality largely limited hiring to a subset of Americans.

Additionally, the rural-urban divide then was more severe than it was today. People from those households like Marc Andressen literally didn't have piped water growing up back then in the 70s (he's recounted the story a lot).

Long story short, I don't buy a lot of the nostalgia for the 70s and 80s I'm seeing in this thread - it's very boomer urban white man coded.


'Household income' would imply that 70k jobs are not at all the norm.


True, but the median American is also not college educated or working in a skilled manufacturing industry, but is true at the 60th-75th percentile in most cases.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: