Why Use ECC?

Jeff Atwood, perhaps the most widely read programming blogger, has a post that makes a case against using ECC memory. My read is that his major points are:

  1. Google didn’t use ECC when they built their servers in 1999
  2. Most RAM errors are hard errors and not soft errors
  3. RAM errors are rare because hardware has improved
  4. If ECC were actually important, it would be used everywhere and not just servers. Paying for optional stuff like this is downright enterprisey

What’s Worked in Computer Science

In 1999, Butler Lampson gave a talk about the past and future of “computer systems research”. Here are his opinions from 1999 on “what worked”.

Yes Maybe No
Virtual memory Parallelism Capabilities
Address spaces RISC Fancy type systems
Packet nets Garbage collection Functional programming
Objects / subtypes Reuse Formal methods
RDB and SQL Software engineering
Transactions RPC
Bitmaps and GUIs Distributed computing
Web Security

Disaggregated Disk

Hardware performance “obviously” affects software performance and affects how software is optimized. For example, the fact that caches are multiple orders of magnitude faster than RAM means that blocked array accesses give better performance than repeatedly striding through an array.

Something that’s occasionally overlooked is that hardware performance also has profound implications for system design and architecture. Let’s look at this table of latencies that’s been passed around since 2012:

Distributed Systems: When Limping Hardware Is Worse Than Dead Hardware

Every once in awhile, you hear a story like “there was a case of a 1-Gbps NIC card on a machine that suddenly was transmitting only at 1 Kbps, which then caused a chain reaction upstream in such a way that the performance of the entire workload of a 100-node cluster was crawling at a snail’s pace, effectively making the system unavailable for all practical purposes”. The stories are interesting and the postmortems are fun to read, but it’s not really clear how vulnerable systems are to this kind of failure or how prevalent these failures are.

The situation reminds me of distributed systems failures before Jepsen. There are lots of anecdotal horror stories, but a common response to those is “works for me”, even when talking about systems that are now known to be fantastically broken. A handful of companies that are really serious about correctness have good tests and metrics, but they mostly don’t talk about them publicly, and the general public has no easy way of figuring out if the systems they’re running are sound.

Lessons Learned From Reading Postmortems

I love reading postmortems. They’re educational, but unlike most educational docs, they tell an entertaining story. I’ve spent a decent chunk of time reading postmortems at both Google and Microsoft. I haven’t done any kind of formal analysis on the most common causes of bad failures (yet), but there are a handful of postmortem patterns that I keep seeing over and over again.

The Dominance of Boring Languages for Large Scale Systems (and Some Other Areas)

If you spend a lot of time on twitter, HN, and reddit, it’s easy to forget how much software is written in languages that are considered to be quite boring. When I think for a minute, the list of software written in C, C++, and Java is really pretty long. Among the transitive closure of things I use and the libraries and infrastructure used by those things, those three languages are ahead by a country mile, with PHP, Ruby, and Python rounding out the top 6. Javascript should be in there somewhere if I throw in front-end stuff, but it’s so ubiquitous that making a list seems a bit pointless.