Lessons Learned From Reading Postmortems

I love reading postmortems. They’re educational, but unlike most educational docs, they tell an entertaining story. I’ve spent a decent chunk of time reading postmortems at both Google and Microsoft. I haven’t done any kind of formal analysis on the most common causes of bad failures (yet), but there are a handful of postmortem patterns that I keep seeing over and over again.

What Are C, C++, and Java Used For?

I sometimes forget how much of the software I use day-to-day, and the infrastructure it runs on, is written in languages that are considered to be quite boring. When I think for a minute, the list of software written in C, C++, and Java is really pretty long. Among the transitive closure of things I use and the libraries and infrastructure used by those things, those three languages are ahead by a country mile, with PHP, Ruby, and Python rounding out the top 6. Javascript should be in there somewhere if I throw in front-end stuff, but it’s so ubiquitous that making a list seems a bit pointless.

Advantages of Monolithic Version Control

Here’s a conversation I keep having:

Someone: Did you hear that Facebook/Google uses a giant monorepo? WTF!
Me: Yeah! It’s really convenient, don’t you think?
Someone: That’s THE MOST RIDICULOUS THING I’ve ever heard. Don’t FB and Google know what a terrible idea it is to put all your code in a single repo?
Me: I think engineers at FB and Google are probably familiar with using smaller repos (doesn’t Junio Hamano work at Google?), and they still prefer a single huge repo for [reasons].
Someone: Oh that does sound pretty nice. I still think it’s weird but I could see why someone would want that.

“[reasons]” is pretty long, so I’m writing this down in order to avoid repeating the same conversation over and over again.

We Used to Build Steel Mills Near Cheap Sources of Power, but Now That’s Where We Build Datacenters

Why are people so concerned with hardware power consumption nowadays? Some common answers to this question are that power is critically important for phones, tablets, and laptops and that we can put more silicon on a modern chip than we can effectively use. In 2001 Patrick Gelsinger observed that if scaling continued at then-current rates, chips would have the power density of a nuclear reactor by 2005, a rocket nozzle by 2010, and the surface of the sun by 2015. Needless to say, that didn’t happen. The importance of portables and scaling limits are both valid and important reasons, but since they’re widely discussed, I’m going to talk about an underrated reason.

People often focus on the portable market because it’s cannibalizing desktop market, but that’s not the only growth market – servers are also becoming more important than desktops, and power is really important for servers. To see why power is important for servers, let’s look at some calculations about how what it costs to run a datacenter from Hennessy & Patterson.

Dunning-Kruger and Other Memes

It’s really common to see claims that some meme is backed by “studies” or “science”. But when I look at the actual studies, it usually turns out that the data are opposed to the claim. Here are the last few instances of this that I’ve run across.

Software Testing From the Perspective of a Hardware Engineer

I’ve been reading a lot about software testing, lately. Coming from a hardware background (CPUs and hardware accelerators), it’s interesting how different software testing is. Bugs in software are much easier to fix, so it makes sense to spend a lot less effort spent on testing. because less effort is spent on testing, methodologies differ; software testing is biased away from methods with high fixed costs, towards methods with high variable costs. But that doesn’t explain all of the differences, or even most of the differences. Most of the differences come from a cultural path dependence, which shows how non-optimally test effort is allocated in both hardware and software.