TIL that Bell Labs and a whole lot of other websites block archive.org, not to mention most search engines. Turns out I have a broken website link in a GitHub repo, caused by the deletion of an old webpage. When I tried to pull the original from archive.org, I found that it’s not available because Bell Labs blocks the archive.org crawler in their robots.txt:
User-agent: Googlebot User-agent: msnbot User-agent: LSgsa-crawler Disallow: /RealAudio/ Disallow: /bl-traces/ Disallow: /fast-os/ Disallow: /hidden/ Disallow: /historic/ Disallow: /incoming/ Disallow: /inferno/ Disallow: /magic/ Disallow: /netlib.depend/ Disallow: /netlib/ Disallow: /p9trace/ Disallow: /plan9/sources/ Disallow: /sources/ Disallow: /tmp/ Disallow: /tripwire/ Visit-time: 0700-1200 Request-rate: 1/5 Crawl-delay: 5 User-agent: * Disallow: /
In fact, Bell Labs not only blocks the Internet Archiver bot, it blocks all bots except for Googlebot, msnbot, and their own corporate bot. And msnbot was superseded by bingbot five years ago!
A quick search using a term that’s only found at Bell Labs1, e.g., “This is a start at making available some of the material from the Tenth Edition Research Unix manual.”, reveals that bing indexes the page; either bingbot follows some msnbot rules, or that msnbot still runs independently and indexes sites like Bell Labs, which ban bingbot but not msnbot. Luckily, in this case, a lot of search engines (like Yahoo and DDG) use Bing results, so Bell Labs hasn’t disappeared from the non-Google internet, but you’re out of luck if you’re one of the 55% of Russians who use yandex.
And all that is a relatively good case, where one non-Google crawler is allowed to operate. It’s not uncommon to see robots.txt files that ban everything but Googlebot. Running a competing search engine and preventing a Google monopoly is hard enough without having sites ban non-Google bots. We don’t need to make it even harder, nor do we need to accidentally2 ban the Internet Archive bot.
P.S. While you’re checking that your robots.txt doesn’t ban everyone but Google, consider looking at your CPUID checks to make sure that you’re using feature flags instead of banning everyone but Intel and AMD.