> And given you-know-what, the battle to establish a new search crawler will be harder than ever. Crawlers are now presumed guilty of scraping for AI services until proven innocent.
I have always wondered but how does wayback machine work, is there no way that we can use wayback archive and then run a index on top of every wayback archive somehow?
1. IIUC depends a lot on "Save Page Now" democratization, which could work, but its not like a crawler.
2. In absence of alexa they depend quite heavily on common crawl, which is quite crazy because there literally is no other place to go. I don't think they can use google's syndicated API, cause they would then start showing ads in their database, which is garbage that would strain their tiny storage budget.
3. Minor from a software engineering perspective but important for survival of the company: since they are an artifact of record storage, to convert that to an index would need a good legal team to battle google to argue. They do that the DoJ's recent ruling in their favor.
I do not know a lot about this subject, but couldn’t you make a pretty decent index off of common crawl? It seems to me the bar is so low you wouldn’t have to have everything. Especially if your goal was not monetization with ads.
I think someone had commented on another thread about SerpAPI the other day that common crawl is quite small. It would be a start, I think the key to a good index people will use is freshness of the results. You need good recall for a search engine, precision tuning/re-ranking is not going to help otherwise.
Are these websites not serving public content? If there's some legal concerns just create a separate scraping LLC that fakes user agent and uses residential IPs or VPN or something. I can't imagine that the companies would follow through with some sort of lawsuit against a scraper that's trying to index their site to get them more visitors, if they allow GoogleBot.
There is a logistics problem here - even if you had enough money to pay, how would you get in touch with every single site to even let them know you're happy to pay? It's not like site operators routinely scan their error logs to see your failed crawling attempts and your offer in the user-agent.
Even if they see it, it's a classic chicken & egg problem: it's not worth the time of the site operator to engage with your offer until your search engine popular enough to matter, but your search engine will never become popular enough to matter if it doesn't have a critical mass of sites to begin with.
Realistically you don't need every single site on board before you index becomes valuable. You can get in touch with sites via social media, email, discord, or even visiting them face to face.
You really do need every single site, as search is a long tail problem. All the interesting stuff is in the fringes, if you only have a few big sites you'll have a search engine of spam.
I think that is only needed for a small subset of queries. Seriously think of the last time you did a search and went to a fringe site as opposed to a well known brand or social media. Ranking quality is much more important than coverage over the whole internet.
"All of the people I know who were friends with this sociopathic child-trafficking pedophile told me he was reformed now" is certainly something to put out there.
dang is, at best, oblivious to the fact that that this site has become a battleground. At worst, he's intentionally chosen sides with his selective removal of flags.
All over in the right professions. Accountant, probably basically everywhere for example. I imagine it wouldn't be hard to pull 70k doing b2b sales. Plenty other white collar work probably gets you there too. Skilled trades also very well compensated today and in demand as well.
Raleigh-Durham Metro, Metro Houston, Metro DFW (imo not the South), Charlotte Metro, and Metro Atlanta off the top of my head and based on median household incomes.
That said, assuming you could afford a 2k square ft house with a backyard in a highly desirable neighborhood similar to what Palo Alto is today on an average person's salary 50 years ago doesn't seem realistic.
Also, 50 years ago, redlining and race as well as gender based discrimination in most jobs was the norm, so unless you were a white (which itself was a narrower term than today) man, there was a glass ceiling, and most jobs that were supposedly high paying in reality largely limited hiring to a subset of Americans.
Additionally, the rural-urban divide then was more severe than it was today. People from those households like Marc Andressen literally didn't have piped water growing up back then in the 70s (he's recounted the story a lot).
Long story short, I don't buy a lot of the nostalgia for the 70s and 80s I'm seeing in this thread - it's very boomer urban white man coded.
True, but the median American is also not college educated or working in a skilled manufacturing industry, but is true at the 60th-75th percentile in most cases.