Measurement, benchmarking, and data analysis are underrated | Patreon

A question I get asked with some frequency is: why bother measuring X, why not build something instead? More bluntly, in a recent conversation with a newsletter author, his comment on some future measurement projects I wanted to do (in the same vein as other projects like keyboard vs. mouse, keyboard, terminal and end-to-end latency measurements) was, "so you just want to get to the top of Hacker News?" The implication for the former is that measuring is less valuable than building and for the latter that measuring isn't valuable at all (perhaps other than for fame), but I don't see measuring as lesser let alone worthless. If anything, because measurement is, like writing, not generally valued, it's much easier to find high ROI measurement projects than high ROI building projects.

Let's start by looking at a few examples of high impact measurement projects. My go-to example for this is Kyle Kingsbury's work with Jepsen. Before Jepsen, a handful of huge companies (the now $1T+ companies that people are calling "hyperscalers") had decently tested distributed systems. They mostly didn't talk about testing methods in a way that really caused the knowledge to spread to the broader industry. Outside of those companies, most distributed systems were, by my standards, not particularly well tested.

At the time, a common pattern in online discussions of distributed correctness was:

Person A: Database X corrupted my data.
Person B: It works for me. It's never corrupted my data.
A: How do you know? Do you ever check for data corruption?
B: What do you mean? I'd know if we had data corruption (alternate answer: sure, we sometimes have data corruption, but it's probably a hardware problem and therefore not our fault)

Kyle's early work found critical flaws in nearly everything he tested, despite Jepsen being much less sophisticated then than it is now:

Many of these problems had existed for quite a while

What’s really surprising about this problem is that it’s gone unaddressed for so long. The original issue was reported in July 2012; almost two full years ago. There’s no discussion on the website, nothing in the documentation, and users going through Elasticsearch training have told me these problems weren’t mentioned in their classes.

Kyle then quotes a number of users who ran into issues into production and then dryly notes

Some people actually advocate using Elasticsearch as a primary data store; I think this is somewhat less than advisable at present

Although we don't have an A/B test of universes where Kyle exists vs. not and can't say how long it would've taken for distributed systems to get serious about correctness in a universe where Kyle didn't exist, from having spent many years looking at how developers treat correctness bugs, I would bet on distributed systems having rampant correctness problems until someone like Kyle came along. The typical response that I've seen when a catastrophic bug is reported is that the project maintainers will assume that the bug report is incorrect (and you can see many examples of this if you look at responses from the first few years of Kyle's work). When the reporter doesn't have a repro for the bug, which is quite common when it comes to distributed systems, the bug will be written off as non-existent.

When the reporter does have a repro, the next line of defense is to argue that the behavior is fine (you can also see many examples of these from looking at responses to Kyle's work). Once the bug is acknowledged as real, the next defense is to argue that the bug doesn't need to be fixed because it's so uncommon (e.g., "It can be tempting to stand on an ivory tower and proclaim theory, but what is the real world cost/benefit? Are you building a NASA Shuttle Crawler-transporter to get groceries?"). And then, after it's acknowledged that the bug should be fixed, the final line of defense is to argue that the project takes correctness very seriously and there's really nothing more that could have been done; development and test methodology doesn't need to change because it was just a fluke that the bug occurred, and analogous bugs won't occur in the future without changes in methodology.

Kyle's work blew through these defenses and, without something like it, my opinion is that we'd still see these as the main defense used against distributed systems bugs (as opposed to test methodologies that can actually produce pretty reliable systems).

That's one particular example, but I find that it's generally true that, in areas where no one is publishing measurements/benchmarks of products, the products are generally sub-optimal, often in ways that are relatively straightforward to fix once measured. Here are a few examples:

This post has made some justifications for why measuring things is valuable but, to be honest, the impetus for my measurements is curiosity. I just want to know the answer to a question; most of the time, I don't write up my results. But even if you have no curiosity about what's actually happening when you interact with the world and you're "just" looking for something useful to do, the lack of measurements of almost everything means that it's easy to find high ROI measurement projects (at least in terms of impact on the world; if you want to make money, building something is probably easier to monetize).

Appendix: the motivation for my measurement posts

There's a sense in which it doesn't really matter why I decided to write these posts, but if I were reading someone else's post on this topic, I'd still be curious what got them writing, so here's what prompted me to write my measurement posts (which, for the purposes of this list, include posts where I collate data and don't do any direct measurement).

BTW, writing up this list made me realize that a narrative I had in my head about how and when I started really looking at data seriously must be wrong. I thought that this was something that came out of my current job, but that clearly cannot be the case since a decent fraction of my posts from before my current job are about looking at data and/or measuring things (and I didn't even list some of the data-driven posts where I just read some papers and look at what data they present). Blogger, measure thyself.

Appendix: why you can't trust some reviews

One thing that both increases and decreases the impact of doing good measurements is that most measurements that are published aren't very good. This increases the personal value of understanding how to do good measurements and of doing good measurements, but it blunts the impact on other people, since people generally don't understand what makes measurements invalid and don't have a good algorithm for deciding which measurements to trust.

There are a variety of reasons that published measurements/reviews are often problematic. A major issue with reviews is that, in some industries, reviewers are highly dependent on manufacturers for review copies.

Car reviews are one of the most extreme examples of this. Consumer Reports is the only major reviewer that independently sources their cars, which often causes them to disagree with other reviewers since they'll try to buy the trim level of the car that most people buy, which is often quite different from the trim level reviewers are given by manufacturers and Consumer Reports generally manages to avoid reviewing cars that are unrepresentatively picked or tuned. There have been a couple where Consumer Reports reviewers (who also buy the cars) have said that they thought someone realized they worked for Consumer Reports and then said that they needed to keep the car overnight before giving them the car they'd just bought; when that's happened, the reviewer has walked away from the purchase.

There's pretty significant copy-to-copy variation between cars and the cars reviewers get tend to be ones that were picked to avoid cosmetic issues (paint problems, panel gaps, etc.) as well as checked for more serious issues. Additionally, cars can have their software and firmware tweaked (e.g., it's common knowledge that review copies of BMWs have an engine "tune" that would void your warranty if you modified your car similarly).

Also, because Consumer Reports isn't getting review copies from manufacturers, they don't have to pull their punches and can write reviews that are highly negative, something you rarely see from car magazines and don't often see from car youtubers, where you generally have to read between the lines to get an honest review since a review that explicitly mentions negative things about a car can mean losing access (the youtuber who goes by "savagegeese" has mentioned having trouble getting access to cars from some companies after giving honest reviews).

Camera lenses are another area where it's been documented that reviewers get unusually good copies of the item. There's tremendous copy-to-copy variation between lenses so vendors pick out good copies and let reviewers borrow those. In many cases (e.g., any of the FE mount ZA Zeiss lenses or the Zeiss lens on the RX-1), based on how many copies of a lens people need to try and return to get a good copy, it appears that the median copy of the lens has noticeable manufacturing defects and that, in expectation, perhaps one in ten lenses has no obvious defect (this could also occur if only a few copies were bad and those were serially returned, but very few photographers really check to see if their lens has issues due to manufacturing variation). Because it's so expensive to obtain a large number of lenses, the amount of copy-to-copy variation was unquantified until lensrentals started measuring it; they've found that different manufacturers can have very different levels of copy-to-copy variation, which I hope will apply pressure to lens makers that are currently selling a lot of bad lenses while selecting good ones to hand to reviewers.

Hard drives are yet another area where it's been documented that reviewers get copies of the item that aren't represnetative. Extreme Tech has reported, multiple times, that Adata, Crucial, and Western Digital have handed out review copies of SSDs that are not what you get as a consumer. One thing I find interesting about that case is that Extreme Tech says

Agreeing to review a manufacturer’s product is an extension of trust on all sides. The manufacturer providing the sample is trusting that the review will be of good quality, thorough, and objective. The reviewer is trusting the manufacturer to provide a sample that accurately reflects the performance, power consumption, and overall design of the final product. When readers arrive to read a review, they are trusting that the reviewer in question has actually tested the hardware and that any benchmarks published were fairly run.

This makes it sound like the reviewer's job is to take a trusted handed to them by the vendor and then run good benchmarks, absolving the reviewer of the responsibility of obtaining representative devices and ensuring that they're representative. I'm reminded of the SRE motto, "hope is not a strategy". Trusting vendors is not a strategy. We know that vendors will lie and cheat to look better at benchmarks. Saying that it's a vendor's fault for lying or cheating can shift the blame, but it won't result in reviews being accurate or useful to consumers.

While we've only discussed a few specific areas where there's published evidence that reviews cannot be trusted because they're compromised by companies, but this isn't anything specific to those industries. As consumers, we should expect that any review that isn't performed by a trusted, independent, agency, that purchases its own review copies has been compromised and is not representative of the median consumer experience.

Another issue with reviews is that most online reviews that are highly ranked in search are really just SEO affiliate farms.

A more general issue is that reviews are also affected by the exact same problem as items that are not reviewed: people generally can't tell which reviews are actually good and which are not, so review sites are selected on things other than the quality of the review. A prime example of this is Wirecutter, which is so popular among tech folks that noting that so many tech apartments in SF have identical Wirecutter recommended items is a tired joke. For people who haven't lived in SF, you can get a peek into the mindset by reading the comments on this post about how it's "impossible" to not buy the wirecutter recommendation for anything which is full of comments from people who re-assure that poster that, due to the high value of the poster's time, it would be irresponsible to do anything else.

The thing I find funny about this is that if you take benchmarking seriously (in any field) and just read the methodology for the median Wirecutter review, without even trying out the items reviewed you can see that the methodology is poor and that they'll generally select items that are mediocre and sometimes even worst in class. A thorough exploration of this really deserves its own post, but I'll cite one example of poorly reviewed items here: in https://benkuhn.net/vc, Ben Kuhn looked into how to create a nice video call experience, which included trying out a variety of microphones and webcams. Naturally, Ben tried Wirecutter's recommended microphone and webcam. The webcam was quite poor, no better than using the camera from an ancient 2014 iMac or his 2020 Macbook (and, to my eye, actually much worse; more on this later). And the microphone was roughly comparable to using the built-in microphone on his laptop.

I have a lot of experience with Wirecutter's recommended webcam because so many people have it and it is shockingly bad in a distinctive way. Ben noted that, if you look at a still image, the white balance is terrible when used in the house he was in, and if you talk to other people who've used the camera, that is a common problem. But the issue I find to be worse is that, if you look at the video, under many conditions (and I think most, given how often I see this), the webcam will refocus regularly, making the entire video flash out of and then back into focus (another issue is that it often focuses on the wrong thing, but that's less common and I don't see that one with everybody who I talk to who uses Wirecutter's recommended webcam). I actually just had a call yesterday with a friend of mine who was using a different setup than I'd normally seen him with, the mediocre but perfectly acceptable macbook webcam. His video was going in and out of focus every 10-30 seconds, so I asked him if he was using Wirecutter's recommended webcam and of course he was, because what other webcam would someone in tech buy that has the same problem?

This level of review quality is pretty typical for Wirecutter reviews and they appear to generally be the most respected and widely used review site among people in tech.

Appendix: capitalism

When I was in high school, there was a clique of proto-edgelords who did things like read The Bell Curve and argue its talking points to anyone who would listen.

One of their favorite topics was how the free market would naturally cause companies that make good products rise to the top and companies that make poor products to disappear, resulting in things generally being safe, a good value, and so on and so forth. I still commonly see this opinion espoused by people working in tech, including people who fill their condos with Wirecutter recommended items. I find the juxtaposition of people arguing that the market will generally result in products being good while they themselves buy overpriced garbage to be deliciously ironic. To be fair, it's not all overpriced garbage. Some of it is overpriced mediocrity and some of it is actually good; it's just that it's not too different from what you'd get if you just naively bought random stuff off of Amazon without reading third-party reviews.

For a related discussion, see this post on people who argue that markets eliminate discrimination even as they discriminate.

Appendix: other examples of the impact of measurement (or lack thereof)

Thanks to Fabian Giesen, Ben Kuhn, Yuri Vishnevsky, @chordowl, Seth Newman, Justin Blank, Per Vognsen, John Hergenroeder, Pam Wolf, Ivan Echevarria, and Jamie Brandon for comments/corrections/discussion.