- Google didn’t use ECC when they built their servers in 1999
- Most RAM errors are hard errors and not soft errors
- RAM errors are rare because hardware has improved
- If ECC were actually important, it would be used everywhere and not just servers. Paying for optional stuff like this is downright enterprisey
In 1999, Butler Lampson gave a talk about the past and future of “computer systems research”. Here are his opinions from 1999 on “what worked”.
|Address spaces||RISC||Fancy type systems|
|Packet nets||Garbage collection||Functional programming|
|Objects / subtypes||Reuse||Formal methods|
|RDB and SQL||Software engineering|
|Bitmaps and GUIs||Distributed computing|
Hardware performance “obviously” affects software performance and affects how software is optimized. For example, the fact that caches are multiple orders of magnitude faster than RAM means that blocked array accesses give better performance than repeatedly striding through an array.
Something that’s occasionally overlooked is that hardware performance also has profound implications for system design and architecture. Let’s look at this table of latencies that’s been passed around since 2012:
Typical server utilization is between 10% and 50%. Google has demonstrated 90% utilization without impacting latency SLAs. Xkcd estimated that Google owns 2 million machines. If you estimate an amortized total cost of $4k per machine per year, that’s $8 billion per year. With numbers like that, even small improvements have a large impact, and this isn’t a small improvement.
Every once in awhile, you hear a story like “there was a case of a 1-Gbps NIC card on a machine that suddenly was transmitting only at 1 Kbps, which then caused a chain reaction upstream in such a way that the performance of the entire workload of a 100-node cluster was crawling at a snail’s pace, effectively making the system unavailable for all practical purposes”. The stories are interesting and the postmortems are fun to read, but it’s not really clear how vulnerable systems are to this kind of failure or how prevalent these failures are.
The situation reminds me of distributed systems failures before Jepsen. There are lots of anecdotal horror stories, but a common response to those is “works for me”, even when talking about systems that are now known to be fantastically broken. A handful of companies that are really serious about correctness have good tests and metrics, but they mostly don’t talk about them publicly, and the general public has no easy way of figuring out if the systems they’re running are sound.
I love reading postmortems. They’re educational, but unlike most educational docs, they tell an entertaining story. I’ve spent a decent chunk of time reading postmortems at both Google and Microsoft. I haven’t done any kind of formal analysis on the most common causes of bad failures (yet), but there are a handful of postmortem patterns that I keep seeing over and over again.
TIL that Bell Labs and a whole lot of other websites block archive.org, not to mention most search engines. Turns out I have a broken website link in a github repo, caused by the deletion of an old webpage. When I tried to pull the original from archive.org, I found that it’s not available because Bell Labs blocks the archive.org crawler in their robots.txt: