The value of in-house expertise

An alternate title for this post might be, "Twitter has a kernel team!?". At this point, I've heard that surprised exclamation enough that I've lost count of the number times that's been said to me (I'd guess that it's more than ten but less than a hundred). If we look at trendy companies that are within a couple factors of two in size of Twitter (in terms of either market cap or number of engineers), they mostly don't have similar expertise, often as a result of path dependence — because they "grew up" in the cloud, they didn't need kernel expertise to keep the lights on the way an on prem company does. While that makes it socially understandable that people who've spent their career at younger, trendier, companies, are surprised by Twitter having a kernel team, I don't think there's a technical reason for the surprise.

Whether or not it has kernel expertise, a company Twitter's size is going to regularly run into kernel issues, from major production incidents to papercuts. Without a kernel team or the equivalent expertise, the company will muddle through the issues, running into unnecessary problems as well as taking an unnecessarily long time to mitigate incidents. As an example of a critical production incident, just because it's already been written up publicly, I'll cite this post, which dryly notes:

Earlier last year, we identified a firewall misconfiguration which accidentally dropped most network traffic. We expected resetting the firewall configuration to fix the issue, but resetting the firewall configuration exposed a kernel bug

What this implies but doesn't explicitly say is that this firewall misconfiguration was the most severe incident that's occured during my time at Twitter and I believe it's actually the most severe outage that Twitter has had since 2013 or so. As a company, we would've still been able to mitigate the issue without a kernel team or another team with deep Linux expertise, but it would've taken longer to understand why the initial fix didn't work, which is the last thing you want when you're debugging a serious outage. Folks on the kernel team were already familiar with the various diagnostic tools and debugging techniques necessary to quickly understand why the initial fix didn't work, which is not common knowledge at some peer companies (I polled folks at a number of similar-scale peer companies to see if they thought they had at least one person with the knowledge necessary to quickly debug the bug and the answer was no at many companies).

Another reason to have in-house expertise in various areas is that they easily pay for themselves, which is a special case of the generic argument that large companies should be larger than most people expect because tiny percentage gains are worth a large amount in absolute dollars. If, in the lifetime of the specialist team like the kernel team, a single person found something that persistently reduced TCO by 0.5%, that would pay for the team in perpetuity, and Twitter’s kernel team has found many such changes. In addition to kernel patches that sometimes have that kind of impact, people will also find configuration issues, etc., that have that kind of impact.

So far, I've only talked about the kernel team because that's the one that most frequently elicits surprise from folks for merely existing, but I get similar reactions when people find out that Twitter has a bunch of ex-Sun JVM folks who worked on HotSpot, like Ramki Ramakrishna, Tony Printezis, and John Coomes. People wonder why a social media company would need such deep JVM expertise. As with the kernel team, companies our size that use the JVM run into weird issues and JVM bugs and it's helpful to have people with deep expertise to debug those kinds of issues. And, as with the kernel team, individual optimizations to the JVM can pay for the team in perpetuity. A concrete example is this patch by Flavio Brasil, which virtualizes compare and swap calls.

The context for this is that Twitter uses a lot of Scala. Despite a lot of claims otherwise, Scala uses more memory and is significantly slower than Java, which has a significant cost if you use Scala at scale, enough that it makes sense to do optimization work to reduce the performance gap between idiomatic Scala and idiomatic Java.

Before the patch, if you profiled our Scala code, you would've seen an unreasonably large amount of time spent in Future/Promise, including in cases where you might naively expect that the compiler would optimize the work away. One reason for this is that Futures use a compare-and-swap (CAS) operation that's opaque to JVM optimization. The patch linked above avoids CAS operations when the Future doesn't escape the scope of the method. This companion patch removes CAS operations in some places that are less amenable to compiler optimization. The two patches combined reduced the cost of typical major Twitter services using idiomatic Scala by 5% to 15%, paying for the JVM team in perpetuity many times over and that wasn't even the biggest win Flavio found that year.

I'm not going to do a team-by-team breakdown of teams that pay for themselves many times over because there are so many of them, even if I limit the scope to "teams that people are surprised that Twitter has".

A related topic is how people talk about "buy vs. build" discussions. I've seen a number of discussions where someone has argued for "buy" because that would obviate the need for expertise in the area. This can be true, but I've seen this argued for much more often than it is true. An example where I think this tends to be untrue is with distributed tracing. We've previously looked at some ways Twitter gets value out of tracing, which came out of the vision Rebecca Isaacs put into place. On the flip side, when I talk to people at peer companies with similar scale, most of them have not (yet?) succeeded at getting significant value from distributed tracing. This is so common that I see a viral Twitter thread about how useless distributed tracing is more than once a year. Even though we went with the more expensive "build" option, just off the top of my head, I can think of multiple uses of tracing that have returned between 10x and 100x the cost of building out tracing, whereas people at a number of companies that have chosen the cheaper "buy" option commonly complain that tracing isn't worth it.

Coincidentally, I was just talking about this exact topic to Pam Wolf, a civil engineering professor with experience in (civil engineering) industry on multiple continents, who had a related opinion. For large scale systems (projects), you need an in-house expert (owner's-side engineer) for each area that you don't handle in your own firm. While it's technically possible to hire yet another firm to be the expert, that's more expensive than developing or hiring in-house expertise and, in the long run, also more risky. That's pretty analogous to my experience working as an electrical engineer as well, where orgs that outsource functions to other companies without retaining an in-house expert pay a very high cost, and not just monetarily. They often ship sub-par designs with long delays on top of having high costs. "Buying" can and often does reduce the amount of expertise necessary, but it often doesn't remove the need for expertise.

This related to another common abstract argument that's commonly made, that companies should concentrate on "their area of comparative advantage" or "most important problems" or "core business need" and outsource everything else. We've already seen a couple of examples where this isn't true because, at a large enough scale, it's more profitable to have in-house expertise than not regardless of whether or not something is core to the business (one could argue that all of the things that are moved in-house are core to the business, but that would make the concept of coreness useless). Another reason this abstract advice is too simplistic is that businesses can somewhat arbitrarily choose what their comparative advantage is. A large¹ example of this would be Apple bringing CPU design in-house. Since acquiring PA Semi (formerly the team from SiByte and, before that, a team from DEC) for $278M, Apple has been producing the best chips in the phone and laptop power envelope by a pretty large margin. But, before the purchase, there was nothing about Apple that made the purchase inevitable, that made CPU design an inherent comparative advantage of Apple. But if a firm can pick an area and make it an area of comparative advantage, saying that the firm should choose to concentrate on its comparative advantage(s) isn't very helpful advice.

$278M is a lot of money in absolute terms, but as a fraction of Apple's resources, that was tiny and much smaller companies also have the capability to do cutting edge work by devoting a small fraction of their resources to it, e.g., Twitter, for a cost that any $100M company could afford, created novel cache algorithms and data structures and is doing other cutting edge cache work. Having great cache infra isn't any more core to Twitter's business than creating a great CPU is to Apple's, but it is a lever that Twitter can use to make more money than it could otherwise.

For small companies, it doesn't make sense to have in-house experts for everything the company touches, but companies don't have to get all that large before it starts making sense to have in-house expertise in their operating system, language runtime, and other components that people often think of as being fairly specialized. Looking back at Twitter's history, Yao Yue has noted that when she was working on cache in Twitter's early days (when we had ~100 engineers), she would regularly go to the kernel team for help debugging production incidents and that, in some cases, debugging could've easily taken 10x longer without help from the kernel team. Social media companies tend to have relatively high scale on a per-user and per-dollar basis, so not every company is going to need the same kind of expertise when they have 100 engineers, but there are going to be other areas that aren't obviously core business needs where expertise will pay off even for a startup that has 100 engineers.

Thanks to Ben Kuhn, Yao Yue, Pam Wolf, John Hergenroeder, Julien Kirch, Tom Brearley, and Kevin Burke for comments/corrections/discussion.

Some other large examples of this are Korean chaebols, like Hyundai. Looking at how Hyundai Group's companies are connected to Hyundai Motor Company isn't really the right lens with which to examine Hyundai, but I'm going to use that lens anyway since most readers of this blog are probably already familiar with Hyundai Motor and will not be familiar with how Korean chaebols operate.

Speaking very roughly, with many exceptions, American companies have tended to take the advice to specialize and concentrate on their competencies, at least since the 80s. This is the opposite of the direction that Korean chaebols have gone. Hyundai not only makes cars, they make the steel their cars use, the robots they use to automate production, the cement used for their factories, the construction equipment used to build their factories, the containers and ships used to ship cars (which they also operate), the transmissions for their cars, etc.

If we look at a particular component, say, their 8-speed transmission vs. the widely used and lauded ZF 8HP transmission, reviewers typically slightly prefer the ZF transmission. But even so, having good-enough in-house transmissions, as well as many other in-house components that companies would typically buy, doesn't exactly seem to be a disadvantage for Hyundai.
^[return]