The Nyquist theorem and limitations of sampling profilers today, with glimpses of tracing tools from the future

Perf is probably the most widely used general purpose performance debugging tool on Linux. There are multiple contenders for the #2 spot, and, like perf, they’re sampling profilers. Sampling profilers are great. They tend to be easy-to-use and low-overhead compared to most alternatives. However, there are large classes of performance problems sampling profilers can’t debug effectively, and those problems are becoming more important.

For example, consider a Google search query. Below, we have a diagram of how a query is carried out. Each of the black boxes is a rack of machines and each line shows a remote procedure call (RPC) from one machine to another.

We saw some really bad Intel CPU bugs in 2015, and we should expect to see more in the future

2015 was a pretty good year for Intel. Their quarterly earnings reports exceeded expectations every quarter. They continue to be the only game in town for the serious server market, which continues to grow exponentially; from the earnings reports of the two largest cloud vendors, we can see that AWS and Azure grew by 80% and 100%, respectively. That growth has effectively offset the damage Intel has seen from the continued decline of the desktop market. For a while, it looked like cloud vendors might be able to avoid the Intel tax by moving their computation onto FPGAs, but Intel bought one of the two serious FPGA vendors and, combined with their fab advantage, they look well positioned to dominate the high-end FPGA market the same way they’ve been dominating the high-end server CPU market. Also, their fine for anti-competitive practices turned out to be $1.45B, much less than the benefit they gained from their anti-competitive practices1.

Things haven’t looked so great on the engineering/bugs side of things, though. I don’t keep track of Intel bugs unless they’re so serious that people I know are scrambling to get a patch in because of the potential impact, and I still heard about two severe bugs this year in the last quarter of the year alone. First, there was the bug found by Ben Serebrin and Jan Beulic, which allowed a guest VM to fault in a way that would cause the CPU to hang in a microcode infinite loop, allowing any VM to DoS its host.

Normalization of deviance in software: how broken practices become standard

Have you ever mentioned something that seems totally normal to you only to be greeted by surprise? Happens to me all the time, when I describe something everyone at work thinks is normal. For some reason, my conversation partner’s face morphs from pleasant smile to rictus of horror. Here are a few representative examples.

There’s the company that is perhaps the nicest place I’ve ever worked, combining the best parts of Valve and Netflix. The people are amazing and you’re given near total freedom to do whatever you want. But as a side effect of the culture, they lose perhaps half of new hires in the first year, some voluntarily and some involuntarily. Totally normal, right?

There’s the company that’s incredibly secretive about infrastructure. For example, there’s the team that was afraid that, if they reported bugs to their hardware vendor, the bugs would get fixed and their competitors would be able to use the fixes. Solution: request the firmware and fix bugs themselves! More recently, I know a group of folks outside the company who tried to reproduce the algorithm in thea paper the company published earlier this year. The group found that they couldn’t reproduce the result, and that the algorithm in the paper resulted in an unusual level of instability; when asked about this, one of the authors responded “well, we have some tweaks that didn’t make it into the paper” and declined to share the tweaks, i.e., the company purposely published an unreproducible result to avoid giving away the details, as is normal. This company enforces secrecy by having a strict policy of firing leakers. This is introduced at orientation with examples of people who got fired for leaking (e.g., the guy who leaked that a concert was going to happen inside a particular office), and by announcing firings for leaks at the company all hands. The result of those policies is that I know multiple people who are afraid to forward emails about things like insurance updates for fear of forwarding the wrong email and getting fired; instead, they use another computer to retype the email and pass it along, or take photos of the email on their phone. Normal.

Big company vs. startup work and pay

There’s a meme that’s been going around for a while now: you should join a startup because the money is better and the work is more technically interesting. Paul Graham says that the best way to make money is to “start or join a startup”, which has been “a reliable way to get rich for hundreds of years”, and that you can “compress a career’s worth of earnings into a few years”. Michael Arrington says that you’ll become a part of history. Joel Spolsky says that by joining a big company, you’ll end up playing foosball and begging people to look at your code. Sam Altman says that if you join Microsoft, you won’t build interesting things and may not work with smart people. They all claim that you’ll learn more and have better options if you go work at a startup. Some of these links are a decade old now, but the same ideas are still circulating and those specific essays are still cited today.

Files are hard

I haven’t used a desktop email client in years. None of them could handle the volume of email I get without at least occasionally corrupting my mailbox. Pine, eudora, and outlook have all corrupted my inbox, forcing me to restore from backup. How is it that desktop mail clients are less reliable than gmail, even though my gmail account not only handles more email than I ever had on desktop clients, but also allows simultaneous access from multiple locations across the globe? Distributed systems have an unfair advantage, in that they can be robust against total disk failure in a way that desktop clients can’t, but none of the file corruption issues I’ve had have been from total disk failure. Why has my experience with desktop applications been so bad?

Should I buy ECC memory?

Jeff Atwood, perhaps the most widely read programming blogger, has a post that makes a case against using ECC memory. My read is that his major points are:

  1. Google didn’t use ECC when they built their servers in 1999
  2. Most RAM errors are hard errors and not soft errors
  3. RAM errors are rare because hardware has improved
  4. If ECC were actually important, it would be used everywhere and not just servers. Paying for optional stuff like this is downright enterprisey

What’s worked in computer science

In 1999, Butler Lampson gave a talk about the past and future of “computer systems research”. Here are his opinions from 1999 on “what worked”.

Yes Maybe No
Virtual memory Parallelism Capabilities
Address spaces RISC Fancy type systems
Packet nets Garbage collection Functional programming
Objects / subtypes Reuse Formal methods
RDB and SQL Software engineering
Transactions RPC
Bitmaps and GUIs Distributed computing
Web Security

Disaggregated disk

Hardware performance “obviously” affects software performance and affects how software is optimized. For example, the fact that caches are multiple orders of magnitude faster than RAM means that blocked array accesses give better performance than repeatedly striding through an array.

Something that’s occasionally overlooked is that hardware performance also has profound implications for system design and architecture. Let’s look at this table of latencies that’s been passed around since 2012:

Distributed systems: when limping hardware is worse than dead hardware

Every once in awhile, you hear a story like “there was a case of a 1-Gbps NIC card on a machine that suddenly was transmitting only at 1 Kbps, which then caused a chain reaction upstream in such a way that the performance of the entire workload of a 100-node cluster was crawling at a snail’s pace, effectively making the system unavailable for all practical purposes”. The stories are interesting and the postmortems are fun to read, but it’s not really clear how vulnerable systems are to this kind of failure or how prevalent these failures are.

The situation reminds me of distributed systems failures before Jepsen. There are lots of anecdotal horror stories, but a common response to those is “works for me”, even when talking about systems that are now known to be fantastically broken. A handful of companies that are really serious about correctness have good tests and metrics, but they mostly don’t talk about them publicly, and the general public has no easy way of figuring out if the systems they’re running are sound.