The complexity of knowledge and skill transfer | Patreon

Software teams full of kids who are just out of school (or just dropped out of school) regularly produce valuable companies. Why should microprocessors be any different? You never hear about a new team successfully making a high-performance microprocessor. Sure, PA Semi's acquisition by Apple was a moderately successful exit, but where did that team come from? They were the SiByte team, which left after SiByte was acquired by Broadcom, and SiByte was composed of many people from DEC who had been working together for over a decade. My old company was similar: an IBM fellow collected the best people he worked with at IBM who was a very early Dell employee and then exec (back when Dell still did interesting design work), then split off to create a chip startup. A hardware team where most of the people are smart new grads usually spends on the order of $100 million over five or six years only to find that they don't have a competitive product (or, more likely, don't even have anything that's close to working)1.

Smart and gets things done” has become the standard for software hiring, but that isn't even enough for plumbing or carpentry. Next time you have a plumbing emergency, let me know how you pick a plumber. Do you hire the same way you do for software, taking the smart kid who's read a few books, tried out some tools at Home Depot, and is a great hacker? Or do you go with the grizzled veteran with decades of experience?

Physical work isn't the kind of thing you can derive from first principles, no matter how smart you are. Consider South Korea after WWII. Its GDP per capita was lower than Ghana, Kenya, and just barely above the Congo. For various reasons, the new regime didn't have to deal with legacy institutions; and they wanted Korea to become a first-world nation.

The story I've heard is that the government started by subsidizing concrete. After many years making concrete, they wanted to move up the chain and start more complex manufacturing. They eventually got to building ships, because shipping was a critical part of the export economy they wanted to create.

They pulled some of their best business people who had learned skills like management and operations in other manufacturing. Those people knew they didn't have the expertise to build ships themselves, so they contracted it out. They made the choice to work with Scottish firms, because Scotland has a long history of shipbuilding. Makes sense, right?

It didn't work. For historical and geographic reasons, Scotland's shipyards weren't full-sized; they built their ships in two halves and then assembled them. Worked fine for them, because they'd be doing it at scale since the 1800s, and had world renowned expertise by the 1900s. But when the unpracticed Koreans tried to build ships using Scottish plans and detailed step-by-step directions, the result was two ship halves that didn't quite fit together and sunk when assembled.

The Koreans eventually managed to start a shipbuilding industry by hiring foreign companies to come and build ships locally, showing people how it's done. And it took decades to get what we would consider basic manufacturing working smoothly, even though one might think that all of the requisite knowledge existed in books, was taught in university courses, and could be had from experts for a small fee. Now, their manufacturing industries are world class, e.g., Hyundai and Kia are up there with Toyota for producing the world's most reliable cars. Going from producing unreliable econoboxes to the most reliable cars you can buy took over a decade, just like it did for Toyota when they did it decades earlier. If there's a shortcut to quality other than hiring a gaggle of people who've done it before, no one's discovered it yet.

Today, anyone with a CS 101 background can take Geoffrey Hinton's course on neural networks and deep learning, and start applying state of the art machine learning techniques in production within a couple months. In software land, you can fix minor bugs in real time. If it takes a whole day to run your regression test suite, you consider yourself lucky because it means you're in one of the few environments that takes testing seriously. If the architecture is fundamentally flawed, you pull out your copy of Feathers' “Working Effectively with Legacy Code” and you apply minor fixes until you're done.

This isn't to say that software isn't hard, it's just a different kind of hard: the sort of hard that can be attacked with genius and perseverance, even without experience. But, if you want to build a ship, and you "only" have a decade of experience with carpentry, milling, metalworking, etc., well, good luck. You're going to need it. With a large ship, “minor” fixes can take days or weeks, and a fundamental flaw means that your ship sinks and you've lost half a year of work and tens of millions of dollars. By the time you get to something with the complexity of a modern high-performance microprocessor, a minor bug discovered in production costs three months and five million dollars. A fundamental flaw in the architecture will cost you five years and hundreds of millions of dollars2.

Physical mistakes are costly. There's no undo and editing isn't simply a matter of pressing some keys; changes consume real, physical resources. You need enough wisdom and experience to avoid common mistakes entirely – especially the ones that can't be fixed.

CPU internals series


In retrospect, I think that I was too optimistic about software in this post. If we're talking about product-market fit and success, I don't think the attitude in the post is wrong and people with little to no experience often do create hits. But now that I've been in the industry for a while and talked to numerous people about infra at various startups as well as large companies, I think creating high quality software infra requires no less experience than creating high quality physical items. Companies that decided this wasn't the case and hire a bunch of smart folks from top schools to build their infra have ended up with low quality, unreliable, expensive, and difficult to operate infrastructure. It just turns out that, if you have very good product-market fit, your company can survive and even thrive while having infra that has 2 9s of uptime and costs an order of magnitude more than your competitor's infra. You'll make less money than you would've otherwise, but the high order bits are all on the product size.

This turns out to even be true when selling infra products, e.g., looking at the sucess of Mongo when compared to companies founded around that time that focused on correctness. Despite very good evidence that Mongo had serious techincal issues, simply claiming that they didn't and repeating that claim loudly gave the 90% of the value of working on correctness at a much lower cost, allowing them to focus on things that users actually care about, like ease of onboarding.

  1. Comparing my old company to another x86 startup founded within the year is instructive. Both started at around the same time. Both had great teams of smart people. Our competitor even had famous software and business people on their side. But it's notable that their hardware implementers weren't a core team of multi-decade industry veterans who had worked together before. It took us about two years to get a working x86 chip, on top of $15M in funding. Our goal was to produce a low-cost chip and we nailed it. It took them five years, with over $250M in funding. Their original goal was to produce a high performance low-power processor, but they missed their performance target so badly that they were forced into the low-cost space. They ended up with worse performance than us, with a chip was 50% bigger (and hence, cost more than 50% more to produce) using team four times our size. They eventually went under, because there's no way they could survive with 4x our burn rate and weaker performance. But, not before burning through $969M in funding (including $230M from patent lawsuits). [return]
  2. A funny side effect of the importance of experience is that age discrimination doesn't affect the areas I've worked in. At 30, I'm bizarrely young for someone who's done microprocessor design. The core folks at my old place were in their 60s. They'd picked up some younger folks along the way, but 30? Freakishly young. People are much younger at the new gig: I'm surrounded by ex-supercomputer folks from Cray and SGI, who are barely pushing 50, along with a couple kids from Synplify and DESRES who, at 40, are unusually young. Not all hardware folks are that old. In another arm of the company, there are folks who grew up in the FPGA world, which is a lot more forgiving. In that group, I think I met someone who's only a few years older than me. Kidding aside, you'll see younger folks doing RTL design on complex projects at large companies that are willing to spend a decade mentoring folks. But, at startups and on small hardware teams that move fast, it's rare to hire someone into design who doesn't have a decade of experience.

    There's a crowd that's even younger than the FPGA folks, even younger than me, working on Arduinos and microcontrollers, doing hobbyist electronics and consumer products. I'm genuinely curious how many of those folks will decide to work on large-scale systems design. In one sense, it's inevitable, as the area matures, and solutions become more complex. The other sense is what I'm curious about: will the hardware renaissance spark an interest in supercomputers, microprocessors, and warehouse-scale computers?