2015 was a pretty good year for Intel. Their quarterly earnings reports exceeded expectations every quarter. They continue to be the only game in town for the serious server market, which continues to grow exponentially; from the earnings reports of the two largest cloud vendors, we can see that AWS and Azure grew by 80% and 100%, respectively. That growth has effectively offset the damage Intel has seen from the continued decline of the desktop market. For a while, it looked like cloud vendors might be able to avoid the Intel tax by moving their computation onto FPGAs, but Intel bought one of the two serious FPGA vendors and, combined with their fab advantage, they look well positioned to dominate the high-end FPGA market the same way they’ve been dominating the high-end server CPU market. Also, their fine for anti-competitive practices turned out to be $1.45B, much less than the benefit they gained from their anti-competitive practices1.
Things haven’t looked so great on the engineering/bugs side of things, though. I don’t keep track of Intel bugs unless they’re so serious that people I know are scrambling to get a patch in because of the potential impact, and I still heard about two severe bugs this year in the last quarter of the year alone. First, there was the bug found by Ben Serebrin and Jan Beulic, which allowed a guest VM to fault in a way that would cause the CPU to hang in a microcode infinite loop, allowing any VM to DoS its host.
Major cloud vendors were quite lucky that this bug was found by a Google engineer, and that Google decided to share its knowledge of the bug with its competitors before publicly disclosing. Black hats spend a lot of time trying to take down major services. I’m actually really impressed by both the persistence and the cleverness of the people who spend their time attacking the companies I work for. If, buried deep in our infrastructure, we have a bit of code running at DPC that’s vulnerable to slowdown because of some kind of hash collision, someone will find and exploit that, even if it takes a long and obscure sequence of events to make it happen. And they’ll often wait until an inconvenient time to start the attack, such as Christmas, or one of the big online shopping days. If this CPU microcode hang had been found by one of these black hats, there would have been major carnage for most cloud hosted services at the most inconvenient possible time2.
Shortly after the Serebrin/Beulic bug was found, a group of people found that running prime95, a commonly used tool for benchmarking and burn-in, causes their entire system to lock up. Intel’s response to this was:
Intel has identified an issue that potentially affects the 6th Gen Intel® Core™ family of products. This issue only occurs under certain complex workload conditions, like those that may be encountered when running applications like Prime95. In those cases, the processor may hang or cause unpredictable system behavior.
which reveals almost nothing about what’s actually going on. If you look at their errata list, you’ll find that this is typical, except that they normally won’t even name the application that was used to trigger the bug. For example, one of the current errata lists has entries like
As we’ve seen, “unexpected system behavior” can mean that we’re completely screwed. Machine checks aren’t great either – they cause Windows to blue screen and Linux to kernel panic. An incorrect address on a page fault is potentially even worse than a mere crash, and if you dig through the list you can find a lot of other scary sounding bugs.
And keep in mind that the Intel errata list has the following disclaimer:
Errata remain in the specification update throughout the product’s lifecycle, or until a particular stepping is no longer commercially available. Under these circumstances, errata removed from the specification update are archived and available upon request.
Once they stop manufacturing a stepping (the hardware equivalent of a point release), they reserve the right to remove the errata and you won’t be able to find out what errata your older stepping has unless you’re important enough to Intel.
Anyway, back to 2015. We’ve seen at least two serious bugs in Intel CPUs in the last quarter3, and it’s almost certain there are more bugs lurking. Back when I worked at a company that produced Intel compatible CPUs, we did a fair amount of testing and characterization of Intel CPUs; as someone fresh out of school who’d previously assumed that CPUs basically worked, I was surprised by how many bugs we were able to find. Even though I never worked on the characterization and competitive analysis side of things, I still personally found multiple Intel CPU bugs just in the normal course of doing my job, poking around to verify things that seemed non-obvious to me. Turns out things that seem non-obvious to me are sometimes also non-obvious to Intel engineers. As more services move to the cloud and the impact of system hang and reset vulnerabilities increases, we’ll see more black hats investing time in finding CPU bugs. We should expect to see a lot more of these when people realize that it’s much easier than it seems to find these bugs. There was a time when a CPU family might only have one bug per year, with serious bugs happening once every few years, or even once a decade, but we’ve moved past that. In part, that’s because “unpredictable system behavior” have moved from being an annoying class of bugs that forces you to restart your computation to an attack vector that lets anyone with an AWS account attack random cloud-hosted services, but it’s mostly because CPUs have gotten more complex, making them more difficult to test and audit effectively, while Intel appears to be cutting back on validation effort. Ironically, we have hardware virtualization that’s supposed to help us with security, but the virtualization is so complicated4 that the hardware virtualization implementation is likely to expose “unpredictable system behavior” bugs that wouldn’t otherwise have existed. This isn’t to say it’s hopeless – it’s possible, in principle, to design CPUs such that a hang bug on one core doesn’t crash the entire system. It’s just that it’s a fair amount of work to do that at every level (cache directories, the uncore, etc., would have to be modified to operate when a core is hung, as well as OS schedulers). No one’s done the work because it hasn’t previously seemed important.
After writing this, an ex-Intel employee said “even with your privileged access, you have no idea” and a pseudo-anonymous commenter on reddit made this comment:
As someone who worked in an Intel Validation group for SOCs until mid-2014 or so I can tell you, yes, you will see more CPU bugs from Intel than you have in the past from the post-FDIV-bug era until recently.
Let me set the scene: It’s late in 2013. Intel is frantic about losing the mobile CPU wars to ARM. Meetings with all the validation groups. Head honcho in charge of Validation says something to the effect of: “We need to move faster. Validation at Intel is taking much longer than it does for our competition. We need to do whatever we can to reduce those times… we can’t live forever in the shadow of the early 90’s FDIV bug, we need to move on. Our competition is moving much faster than we are” - I’m paraphrasing. Many of the engineers in the room could remember the FDIV bug and the ensuing problems caused for Intel 20 years prior. Many of us were aghast that someone highly placed would suggest we needed to cut corners in validation - that wasn’t explicitly said, of course, but that was the implicit message. That meeting there in late 2013 signaled a sea change at Intel to many of us who were there. And it didn’t seem like it was going to be a good kind of sea change. Some of us chose to get out while the getting was good. As someone who worked in an Intel Validation group for SOCs until mid-2014 or so I can tell you, yes, you will see more CPU bugs from Intel than you have in the past from the post-FDIV-bug era until recently.
I haven’t been able to confirm this story from another source I personally know, although another anonymous commenter said “I left INTC in mid 2013. From validation. This … is accurate compared with my experience.” Another anonymous person, someone I know, didn’t hear that speech, but found that at around that time, “velocity” became a buzzword and management spent a lot of time talking about how Intel needs more “velocity” to compete with ARM, which appears to confirm the sentiment, if not the actual speech.
I’ve also heard from formal methods people that, around that time, there was an exodus of formal verification folks. One story I’ve heard is that people left because they were worried about being made redundant. I’m told that, at the time, early retirement packages were being floated around and people strongly suspected layoffs. Another story I’ve heard is that things got really strange due to Intel’s focus on the mobile battle with ARM, and people wanted to leave before things got even worse. But it’s hard to say of this means anything, since Intel has been losing a lot of people to Apple because Apple offers better compensation packages and the promise of being less dysfunctional.
I also got anonymous stories about bugs. One person who works in HPC told me that when they were shopping for Haswell parts, a little bird told them that they’d see drastically reduced performance on variants with greater than 12 cores. When they tried building out both 12-core and 16-core systems, they found that they got noticeably better performance on their 12-core systems across a wide variety of workloads. That’s not better per-core performance – that’s better absolute performance. Adding 4 more cores reduced the performance on parallel workloads! That was true both in single-socket and two-socket benchmarks.
There’s also a mysterious hang during idle/low-activity bug that Intel doesn’t seem to have figured out yet.
And then there’s this Broadwell bug that hangs Linux if you don’t disable low-power states.
And of course Intel isn’t the only company with bugs – this AMD bug found by Robert Swiecki not only allows a VM to crash its host, it also allows a VM to take over the host.
I doubt I’ve even heard of all the recent bugs and stories about verification/validation. Feel free to send other reports my way.
A number of folks have noticed unusual failure rates in storage devices and switches. This appears to be related to an Intel Atom bug. I find this interesting because the Atom is a relatively simple chip, and therefore a relatively simple chip to verify. When the first-gen Atom was released, folks at Intel seemed proud of how few internal spins of the chip were needed to ship a working production chip that, something made possible by the simplicity of the chip. Modern Atoms are more complicated, but not that much more complicated.
Intel Skylake and Kaby Lake have a hyperthreading bug that’s so serious that Debian recommends that users disable hyperthreading to avoid the bug, which can “cause spurious errors, such as application and system misbehavior, data corruption, and data loss”.
On the AMD side, there might be a bug that’s as serious any recent Intel CPU bug. If you read that linked thread, you’ll see an AMD representative asking people to disable SMT, OPCache Control, and changing LLC settings to possibly mitigate or narrow down a serious crashing bug. On another thread, you can find someone reporting an #MC exception with “u-op cache crc mismatch”.
Although AMD’s response in the forum was that these were isolated issues, phoronix was able to reproduce crashes by running a stress test that consists of compiling a number of open source programs. They report they were able to get 53 segfaults with one hour of attempted compilation.
Some FreeBSD folks have also noticed seemingly unrelated crashes and have been able to get a reproduction by running code at a high address and then firing an interrupt. This can result in a hang or a crash. The reason this appears to be unrelated to the first reported Ryzen issues is that this is easily reproducible with SMT disabled.
Matt Dillon found an AMD bug triggered by DragonflyBSD, and commited a tiny patch to fix it:
There is a bug in Ryzen related to the kernel iretq’ing into a high user %rip address near the end of the user address space (top of user stack). This is a temporary workaround for the issue.
The original %rip for sigtramp was 0x00007fffffffffe0. Moving it down to fa0 wasn’t sufficient. Moving it down to f00 moved the bug from nearly instant to taking a few hours to reproduce. Moving it down to be0 it took a day to reproduce. Moving it down to 0x00007ffffffffba0 (this commit) survived the overnight test.
Thanks to Leah Hanson, Jeff Ligouri, Derek Slager, Ralph Corderoy, Joe Wilder, Nate Martin, Hari Angepat, and a number of anonymous tipsters for comments/corrections/discussion.