I can’t think of a single large software company that doesn’t regularly draw internet comments of the form “What do all the employees do? I could build their product myself.” Benjamin Pollack and Jeff Atwood called out people who do that with Stack Overflow. But Stack Overflow is relatively obviously lean, so the general response is something like “oh, sure maybe Stack Overflow is lean, but FooCorp must really be bloated”. And since most people have relatively little visibility into FooCorp, for any given value of FooCorp, that sounds like a plausible statement. After all, what product could possible require hundreds, or even thousands of engineers?
A few years ago, in the wake of the rapgenius SEO controversy, a number of folks called for someone to write a better Google. Alex Clemmer responded that maybe building a better Google is a non-trivial problem. Considering how much of Google’s $500B market cap comes from search, and how much money has been spent by tens (hundreds?) of competitors in an attempt to capture some of that value, it seems plausible to me that search isn’t a trivial problem. But in the comments on Alex’s posts, multiple people respond and say that Lucene basically does the same thing Google does and that Lucene is poised to surpass Google’s capabilities in the next few years.
What would Lucene at Google’s size look like? If we do a naive back of the envelope calculation on what it would take to index a significant fraction of the internet (often estimated to be 1 trillion (T) or 10T documents), we might expect a 1T document index to cost something like $10B1. That’s not a feasible startup, so let’s say that instead of trying to index 1T documents, we want to maintain an artisanal search index of 1B documents. Then our cost comes down to $12M/yr. That’s not so bad – plenty of startups burn through more than that every year. While we’re in the VC-funded hypergrowth mode, that’s fine, but once we have a real business, we’ll want to consider trying to save money. At $12M/yr for the index, a 3% performance improvement that lets us trim our costs by 2% is worth $360k/yr. With those kinds of costs, it’s surely worth it to have at least one engineer working full-time on optimization, if not more.
Businesses that actually care about turning a profit will spend a lot of time (hence, a lot of engineers) working on optimizing systems, even if an MVP for the system could have been built in a weekend. There’s also a wide body of research that’s found that decreasing latency has a roughly linear effect on revenue over a pretty wide range of latencies and businesses. Businesses should keep adding engineers to work on optimization until the cost of adding an engineer equals the revenue gain plus the cost savings at the margin. This is often many more engineers than people realize.
And that’s just performance. Features also matter: when I talk to engineers working on basically any product at any company, they’ll often find that there are seemingly trivial individual features that can add integer percentage points to revenue. Just as with performance, people underestimate how many engineers you can add to a product before engineers stop paying for themselves.
Additionally, features are often much more complex than outsiders realize. If we look at search, how do we make sure that different forms of dates and phone numbers give the same results? How about internationalization? Each language has unique quirks that have to be accounted for. In french, “l’foo” should often match “un foo” and vice versa, but American search engines from the 90s didn’t actually handle that correctly. How about tokenizing Chinese queries, where words don’t have spaces between them, and sentences don’t have unique tokenizations? How about Japanese, where queries can easily contain four different alphabets? How about handling Arabic, which is mostly read right-to-left, except for the bits that are read left-to-right? And that’s not even the most complicated part of handling Arabic! It’s fine to ignore this stuff for a weekend-project MVP, but ignoring it in a real business means ignoring the majority of the market! Some of these are handled ok by open source projects, but many of the problems involve open research problems.
There’s also security! If you don’t “bloat” your company by hiring security people, you’ll end up like hotmail or yahoo, where your product is better known for how often it’s hacked than for any of its other features.
Everything we’ve looked at so far is a technical problem. Compared to organizational problems, technical problems are straightforward. Distributed systems are considered hard because real systems might drop something like 0.1% of messages, corrupt an even smaller percentage of messages, and see latencies in the microsecond to millisecond range. When I talk to higher-ups and compare what they think they’re saying to what my coworkers think they’re saying, I find that the rate of lost messages is well over 50%, every message gets corrupted, and latency can be months or years2. When people imagine how long it should take to build something, they’re often imagining a team that works perfectly and spends 100% of its time coding. But that’s impossible to scale up. The question isn’t whether or not there will inefficiencies, but how much inefficiency. A company that could eliminate organizational inefficiency would be a larger innovation than any tech startup, ever. But when doing the math on how many employees a company “should” have, people usually assume that the company is an efficient organization.
This post happens to use search as an example because I ran across some people who claimed that Lucene was going to surpass Google’s capabilities any day now, but there’s nothing about this post that’s unique to search. If you talk to people in almost any field, you’ll hear stories about how people wildly underestimate the complexity of the problems in the field. The point here isn’t that it would be impossible for a small team to build something better than Google search. It’s entirely plausible that someone will have an innovation as great as PageRank, and that a small team could turn that into a viable company. But once that company is past the VC-funded hyper growth phase and wants to maximize its profits, it will end up with a multi-thousand person platforms org, just like Google’s, unless the company wants to leave hundreds of millions or billions of dollars a year on the table due to hardware and software inefficiency. And the company will want to handle languages like Thai, Arabic, Chinese, and Japanese, each of which is non-trivial. And the company will want to have relatively good security. And there are the hundreds of little features that users don’t even realize that are there, each of which provides a noticeable increase in revenue. It’s “obvious” that companies should outsource their billing, except that when you talk to companies that handle their own billing, they can point to individual features that increase conversion by single or double digit percentages that they can’t get from Stripe or Braintree. That fifty person billing team is totally worth it, beyond a certain size. And then there’s sales, which most engineers don’t even think of3, not to mention research (which, almost by definition, involves a lot of bets that don’t pan out).
It’s not that all of those things are necessary to run a service at all; it’s that almost every large service is leaving money on the table if they don’t seriously address those things. This reminds me of a common fallacy we see in unreliable systems, where people build the happy path with the idea that the happy path is the “real” work, and that error handling can be tacked on later. For reliable systems, error handling is more work than the happy path. The same thing is true for large services – all of this stuff that people don’t think of as “real” work is more work than the core service4.
I’m experimenting with writing blog posts stream-of-consciousness, without much editing. Both this post and my last post were written that way. Let me know what you think of these posts relative to my “normal” posts!
Thanks to Leah Hanson, Joel Wilder, Kay Rhodes, and Ivar Refsdal for corrections.
In public benchmarks, Lucene appears to get something like 30 QPS - 40 QPS when indexing Wikipedia on a single machine. See anandtech, Haque et al., ASPLOS 2015, etc. I’ve seen claims that Lucene can run 10x faster than that on wikipedia but I haven’t seen a reproducible benchmark setup showing that, so let’s say that we can expect to get something like 30 QPS - 300 QPS if we index a wikipedia-sized corpus on one machine.
Those benchmarks appear to be indexing English Wikipedia, articles only. That’s roughly 50 GB and approximately 5m documents. Estimates of the size of the internet vary, but public estimates often fall into the range of 1 trillion (T) to 10T documents. Say we want to index 1T documents, and we can put 5m documents per machine: we need
1T/5m = 200k machines to handle all of the extra documents. None of the off-the-shelf sharding/distribution solutions that are commonly used with Lucene can scale to 200k machines, but let’s posit that we can solve that problem and can operate a search cluster with 200k machines. We’ll also need to have some replication so that queries don’t return bad results if a single machine goes down. If we replicate every machine once, that’s 400k machines. But that’s 400k machines for just one cluster. If we only have one cluster sitting in some location, users in other geographic regions will experience bad latency to our service, so many we want to have ten such clusters. If we have ten such clusters, that’s 4M machines.
In the Anandtech wikipedia benchmark, they get 30 QPS out of a single-socket Broadwell Xeon D with 64 GB of RAM (enough to fit the index in memory). If we don’t want to employ the army of people necessary to build out and run 4M machines worth of datacenters, AFAICT the cheapest VM that’s plausibly at least as “good” as that machine is the GCE n1-highmem-8, which goes for $0.352hr. If we multiply that out by 4M machines, that’s a little over $1.4M an hour, or a little more than $12B a year for a service that can’t even get within an order of magnitude of the query rate or latency necessary to run a service like Google or Bing. And that’s just for the index – even a minimal search engine also requires crawling. BTW, people will often claim that this is easy because they have much larger indices in Lucene, but with a posting-list based algorithm like Lucene, you can very roughly think of query rate as inversely related to the number of postings. When you ask these people with their giant indices what their query rate is, you’ll inevitably find that it’s glacial by internet standards. For reference, the core of twitter was a rails app that could handle something like 200 QPS until 2008. If you look at what most people handle with Lucene, it’s often well under 1 QPS, with documents that are much smaller than the average web document, using configurations that damage search relevance too much to be used in commercial search engines (e.g., using stop words). That’s fine, but that fact that people think that sort of experience is somehow relevant to web search is indicative of the problem this post is discussing.
That also assumes that we won’t hit any other scaling problem if we can make 400k VM clusters. But finding an open source index which will scale not only to the number of documents on the internet, but also the number of terms, is non-trivial. Before you read the next section, try guessing how many unique terms there are online. And then if we shard the internet so that we have 5m documents per machine, try guessing how many unique terms you expect to see per shard.
When I ask this question, I often hear guesses like “ten million” or “ten billion”. But without even looking at the entire internet, just looking at one single document on GitHub, we can find a document with fifty million unique terms:
So there are definitely more than ten million unique terms on the entire internet! In fact, there’s a website out there that has all primes under one trillion. I believe there are something like thirty-seven billion of those. If that website falls into one shard of our index, we’d expect to see more than thirty-seven billion terms in a single shard; that’s more than most people guess we’ll see on the entire internet, and that’s just in one shard that happens to contain one somewhat pathological site. If we try to put the internet into any existing open source index that I know of, not only will it not be able to scale out enough horizontally, many shards will contain data weird enough to make the entire shard fall over if we run a query. That’s nothing against open source software; like any software, it’s designed to satisfy the needs of its users, and none of its users do anything like index the entire internet. As businesses scale up, they run into funny corner cases that people without exposure to the particular domain don’t anticipate.
People often object that you don’t need to index all of this weird stuff. There have been efforts to build web search engines that only index the “important” stuff, but it turns out that if you ask people to evaluate search engines, some people will type in the weirdest queries they can think of and base their evaluation off of that. And others type in what they think of as normal queries for their day-to-day work even if they seem weird to you (e.g., a biologist might query for
GTGACCTTGGGCAAGTTACTTAACCTCTCTGTGCCTCAGTTTCCTCATCTGTAAAATGGGGATAATA). If you want to be anything but a tiny niche player, you have to handle not only the weirdest stuff you can think of, but the weirdest stuff that many people can think of.