We saw some really bad Intel CPU bugs in 2015 and we should expect to see more in the future

2015 was a pretty good year for Intel. Their quarterly earnings reports exceeded expectations every quarter. They continue to be the only game in town for the serious server market, which continues to grow exponentially; from the earnings reports of the two largest cloud vendors, we can see that AWS and Azure grew by 80% and 100%, respectively. That growth has effectively offset the damage Intel has seen from the continued decline of the desktop market. For a while, it looked like cloud vendors might be able to avoid the Intel tax by moving their computation onto FPGAs, but Intel bought one of the two serious FPGA vendors and, combined with their fab advantage, they look well positioned to dominate the high-end FPGA market the same way they've been dominating the high-end server CPU market. Also, their fine for anti-competitive practices turned out to be $1.45B, much less than the benefit they gained from their anti-competitive practices¹.

Things haven't looked so great on the engineering/bugs side of things, though. We've seen a number of fairly serious CPU bugs and it looks like we should expect more in the future. I don't keep track of Intel bugs unless they're so serious that people I know are scrambling to get a patch in because of the potential impact, and I still heard about two severe bugs this year in the last quarter of the year alone. First, there was the bug found by Ben Serebrin and Jan Beulic, which allowed a guest VM to fault in a way that would cause the CPU to hang in a microcode infinite loop, allowing any VM to DoS its host.

Major cloud vendors were quite lucky that this bug was found by a Google engineer, and that Google decided to share its knowledge of the bug with its competitors before publicly disclosing. Black hats spend a lot of time trying to take down major services. I'm actually really impressed by both the persistence and the cleverness of the people who spend their time attacking the companies I work for. If, buried deep in our infrastructure, we have a bit of code running at DPC that's vulnerable to slowdown because of some kind of hash collision, someone will find and exploit that, even if it takes a long and obscure sequence of events to make it happen. If this CPU microcode hang had been found by one of these black hats, there would have been major carnage for most cloud hosted services at the most inconvenient possible time².

Shortly after the Serebrin/Beulic bug was found, a group of people found that running prime95, a commonly used tool for benchmarking and burn-in, causes their entire system to lock up. Intel's response to this was:

Intel has identified an issue that potentially affects the 6th Gen Intel® Core™ family of products. This issue only occurs under certain complex workload conditions, like those that may be encountered when running applications like Prime95. In those cases, the processor may hang or cause unpredictable system behavior.

which reveals almost nothing about what's actually going on. If you look at their errata list, you'll find that this is typical, except that they normally won't even name the application that was used to trigger the bug. For example, one of the current errata lists has entries like

Certain Combinations of AVX Instructions May Cause Unpredictable System Behavior
AVX Gather Instruction That Should Result in #DF May Cause Unexpected System Behavior
Processor May Experience a Spurious LLC-Related Machine Check During Periods of High Activity
Page Fault May Report Incorrect Fault Information

As we've seen, “unexpected system behavior” can mean that we're completely screwed. Machine checks aren't great either -- they cause Windows to blue screen and Linux to kernel panic. An incorrect address on a page fault is potentially even worse than a mere crash, and if you dig through the list you can find a lot of other scary sounding bugs.

And keep in mind that the Intel errata list has the following disclaimer:

Errata remain in the specification update throughout the product's lifecycle, or until a particular stepping is no longer commercially available. Under these circumstances, errata removed from the specification update are archived and available upon request.

Once they stop manufacturing a stepping (the hardware equivalent of a point release), they reserve the right to remove the errata and you won't be able to find out what errata your older stepping has unless you're important enough to Intel.

Anyway, back to 2015. We've seen at least two serious bugs in Intel CPUs in the last quarter³, and it's almost certain there are more bugs lurking. Back when I worked at a company that produced Intel compatible CPUs, we did a fair amount of testing and characterization of Intel CPUs; as someone fresh out of school who'd previously assumed that CPUs basically worked, I was surprised by how many bugs we were able to find. Even though I never worked on the characterization and competitive analysis side of things, I still personally found multiple Intel CPU bugs just in the normal course of doing my job, poking around to verify things that seemed non-obvious to me. Turns out things that seem non-obvious to me are sometimes also non-obvious to Intel engineers. As more services move to the cloud and the impact of system hang and reset vulnerabilities increases, we'll see more black hats investing time in finding CPU bugs. We should expect to see a lot more of these when people realize that it's much easier than it seems to find these bugs. There was a time when a CPU family might only have one bug per year, with serious bugs happening once every few years, or even once a decade, but we've moved past that. In part, that's because "unpredictable system behavior" have moved from being an annoying class of bugs that forces you to restart your computation to an attack vector that lets anyone with an AWS account attack random cloud-hosted services, but it's mostly because CPUs have gotten more complex, making them more difficult to test and audit effectively, while Intel appears to be cutting back on validation effort. Ironically, we have hardware virtualization that's supposed to help us with security, but the virtualization is so complicated⁴ that the hardware virtualization implementation is likely to expose "unpredictable system behavior" bugs that wouldn't otherwise have existed. This isn't to say it's hopeless -- it's possible, in principle, to design CPUs such that a hang bug on one core doesn't crash the entire system. It's just that it's a fair amount of work to do that at every level (cache directories, the uncore, etc., would have to be modified to operate when a core is hung, as well as OS schedulers). No one's done the work because it hasn't previously seemed important.

You'll often hear software folks say that these things don't matter because they can (sometimes) be patched. But, many devices will never get patched, which means that hardware security bugs will leave some devices vulnerable for their entire lifetime. And even if you don't care about consumers, serious bugs are very bad for CPU vendors. At a company I worked for, we once had a bug escape validation and get found after we shipped. One OEM wouldn't talk to us for something like five years after that, and other OEMs that continued working with us had to re-qualify their parts with our microcode patch and they made sure to let us know how expensive that was. Intel has enough weight that OEMs can't just walk away from them after a bug, but they don't have unlimited political capital and every serious bug uses up political capital, even if it can be patched.

This isn't to say that we should try to get to zero bugs. There's always going to be a trade off between development speed and and bug rate and the optimal point probably isn't zero bugs. But we're now regularly seeing severe bugs with security implications, which changes the tradeoff a lot. With something like the FDIV bug you can argue that it's statistically unlikely that any particular user who doesn't run numerical analysis code will be impacted, but security bugs are different. Attackers don't run random code, so you can't just say that it's unlikely that some condition will occur.

Update

After writing this, a person claiming to be an ex-Intel employee said "even with your privileged access, you have no idea" and a pseudo-anonymous commenter on reddit made this comment:

As someone who worked in an Intel Validation group for SOCs until mid-2014 or so I can tell you, yes, you will see more CPU bugs from Intel than you have in the past from the post-FDIV-bug era until recently.

Why?

Let me set the scene: It's late in 2013. Intel is frantic about losing the mobile CPU wars to ARM. Meetings with all the validation groups. Head honcho in charge of Validation says something to the effect of: "We need to move faster. Validation at Intel is taking much longer than it does for our competition. We need to do whatever we can to reduce those times... we can't live forever in the shadow of the early 90's FDIV bug, we need to move on. Our competition is moving much faster than we are" - I'm paraphrasing. Many of the engineers in the room could remember the FDIV bug and the ensuing problems caused for Intel 20 years prior. Many of us were aghast that someone highly placed would suggest we needed to cut corners in validation - that wasn't explicitly said, of course, but that was the implicit message. That meeting there in late 2013 signaled a sea change at Intel to many of us who were there. And it didn't seem like it was going to be a good kind of sea change. Some of us chose to get out while the getting was good. As someone who worked in an Intel Validation group for SOCs until mid-2014 or so I can tell you, yes, you will see more CPU bugs from Intel than you have in the past from the post-FDIV-bug era until recently.

I haven't been able to confirm this story from another source I personally know, although another anonymous commenter said "I left INTC in mid 2013. From validation. This ... is accurate compared with my experience." Another anonymous person, someone I know, didn't hear that speech, but found that at around that time, "velocity" became a buzzword and management spent a lot of time talking about how Intel needs more "velocity" to compete with ARM, which appears to confirm the sentiment, if not the actual speech.

I've also heard from formal methods people that, around the-timeframe mentioned in the first comment, there was an exodus of formal verification folks. One story I've heard is that people left because they were worried about being made redundant. I'm told that, at the time, early retirement packages were being floated around and people strongly suspected layoffs. Another story I've heard is that things got really strange due to Intel's focus on the mobile battle with ARM, and people wanted to leave before things got even worse. But it's hard to say of this means anything, since Intel has been losing a lot of people to Apple because Apple offers better compensation packages and the promise of being less dysfunctional.

I also got anonymous stories about bugs. One person who works in HPC told me that when they were shopping for Haswell parts, a little bird told them that they'd see drastically reduced performance on variants with greater than 12 cores. When they tried building out both 12-core and 16-core systems, they found that they got noticeably better performance on their 12-core systems across a wide variety of workloads. That's not better per-core performance -- that's better absolute performance. Adding 4 more cores reduced the performance on parallel workloads! That was true both in single-socket and two-socket benchmarks.

There's also a mysterious hang during idle/low-activity bug that Intel doesn't seem to have figured out yet.

And then there's this Broadwell bug that hangs Linux if you don't disable low-power states.

And of course Intel isn't the only company with bugs -- this AMD bug found by Robert Swiecki not only allows a VM to crash its host, it also allows a VM to take over the host.

I doubt I've even heard of all the recent bugs and stories about verification/validation. Feel free to send other reports my way.

More updates

A number of folks have noticed unusual failure rates in storage devices and switches. This appears to be related to an Intel Atom bug. I find this interesting because the Atom is a relatively simple chip, and therefore a relatively simple chip to verify. When the first-gen Atom was released, folks at Intel seemed proud of how few internal spins of the chip were needed to ship a working production chip that, something made possible by the simplicity of the chip. Modern Atoms are more complicated, but not that much more complicated.

Intel Skylake and Kaby Lake have a hyperthreading bug that's so serious that Debian recommends that users disable hyperthreading to avoid the bug, which can "cause spurious errors, such as application and system misbehavior, data corruption, and data loss".

On the AMD side, there might be a bug that's as serious any recent Intel CPU bug. If you read that linked thread, you'll see an AMD representative asking people to disable SMT, OPCache Control, and changing LLC settings to possibly mitigate or narrow down a serious crashing bug. On another thread, you can find someone reporting an #MC exception with "u-op cache crc mismatch".

Although AMD's response in the forum was that these were isolated issues, phoronix was able to reproduce crashes by running a stress test that consists of compiling a number of open source programs. They report they were able to get 53 segfaults with one hour of attempted compilation.

Some FreeBSD folks have also noticed seemingly unrelated crashes and have been able to get a reproduction by running code at a high address and then firing an interrupt. This can result in a hang or a crash. The reason this appears to be unrelated to the first reported Ryzen issues is that this is easily reproducible with SMT disabled.

Matt Dillon found an AMD bug triggered by DragonflyBSD, and commited a tiny patch to fix it:

There is a bug in Ryzen related to the kernel iretq'ing into a high user %rip address near the end of the user address space (top of user stack). This is a temporary workaround for the issue.

The original %rip for sigtramp was 0x00007fffffffffe0. Moving it down to fa0 wasn't sufficient. Moving it down to f00 moved the bug from nearly instant to taking a few hours to reproduce. Moving it down to be0 it took a day to reproduce. Moving it down to 0x00007ffffffffba0 (this commit) survived the overnight test.

Meltdown / spectre update

This is an interesting class of attack that takes advantage of speculative execution plus side channel attacks to leak privileged information into user processes. It seems that at least some of these attacks be done from javascript in the browser.

Regarding the comments in the first couple updates on Intel's attitude towards validation recently, another person claiming to be ex-Intel backs up the statements above:

As a former Intel employee this aligns closely with my experience. I didn't work in validation (actually joined as part of Altera) but velocity is an absolute buzzword and the senior management's approach to complex challenges is sheer panic. Slips in schedules are not tolerated at all - so problems in validation are an existential threat, your project can easily just be canned. Also, because of the size of the company the ways in which quality and completeness are 'acheived' is hugely bureaucratic and rarely reflect true engineering fundamentals.

2024 update

We're approaching a decade since I wrote this post and the serious CPU bugs keep coming. For example, this recent one was found by RAD tools:

Intel Processor Instability Causing Oodle Decompression Failures

We believe that this is a hardware problem which affects primarily Intel 13900K and 14900K processors, less likely 13700, 14700 and other related processors as well. Only a small fraction of those processors will exhibit this behavior. The problem seems to be caused by a combination of BIOS settings and the high clock rates and power usage of these processors, leading to system instability and unpredictable behavior under heavy load ... Any programs which heavily use the processor on many threads may cause crashes or unpredictable behavior. There have been crashes seen in RealBench, CineBench, Prime95, Handbrake, Visual Studio, and more. This problem can also show up as a GPU error message, such as spurious "out of video memory" errors, even though it is caused by the CPU.

One can argue that this is a configuration bug, but from the standpoint of a typical user, all what they observe is that their CPU is causing crashes. And, realistically, Intel knows that their CPUs are shipping into systems with these settings. The mitigation for this involves doing things like changing the following settings ""SVID behavior" → "Intel fail safe", "Long duration power limit" → reduce to 125W if set higher ("Processor Base Power" on ARK)", "Short duration power limit" → reduce to 253W if set higher (for 13900/14900 CPUs, other CPUs have other limits! "Maximum Turbo Power" on ARK)", etc.

If they wanted their CPUs to not crash due to this issue, they could have and should have enforced these settings as well as some others. Instead, they left this up to the BIOS settings, and here we are.

Historically, Intel was much more serious about verification, validation, and testing than AMD and we saw this in their output. At one point, when a lot of enthusiast sites were excited about AMD (in the K7 days), Google stopped using AMD and basically banned purchases of AMD CPUs because they were so buggy and had caused so many hard-to-debug problems. But, over time, the relative level of verification/validation/test effort Intel allocates has gone down and Intel seems to have nearly caught or maybe caught AMD in their rate of really serious bugs. Considering Intel's current market position, with very heavy pressure from AMD, ARM, and Nvidia, it seems unlikely that Intel will turn this around in the foreseeable future. Nvidia, historically, has been significantly buggier than AMD or Intel, so Intel still has quite a bit of room to run to become the most buggy major chip manufacturer. Considering that Nvidia is one of the biggest threats to Intel and how Intel responded to threats from other, then-buggier, manufacturers, it seems like we should expect an even higher rate of bad bugs in the coming decade.

On the specific bug, there's tremendous pressure to operate more like a "move fast and break things" software company than a traditional, conservative, CPU manufacturer for multiple reasons. When you make a manufacture a CPU, how fast it will run ends up being somewhat random and there's no reliable way to tell how fast it will run other than testing it, so CPU companies run a set of tests on the CPU to see how fast it will go. This test time is actually fairly expensive, so there's a lot of work done to try to find the smallest set of tests possible that will correctly determine how fast the CPU can operate. One easy way to cut costs here is to just run fewer tests even if the smaller set of tests doesn't fully guarantee that the CPU can operate at the speed it's sold at.

Another factor influencing this is that CPUs that are sold as nominally faster can sell for more, so there's also pressure to push the CPUs as close to their limits as possible. One way we can see that the margin here has, in general, decreased, is by looking at how overclockable CPUs are. People are often happy with their overclocked CPU if they run a few tests, like prime95, stresstest, etc., and their part doesn't crash, but this isn't nearly enough to determine if the CPU can really run everything a user could throw at it, but if you really try to seriously test a CPU (working at an Intel competitor, we would do this regularly), Intel and other CPU companies have really pushed the limit of how fast they claim their CPUs are relative to how fast they actually are, which sometimes results in CPUs that are sold that have been pushed beyond their capabilities.

On overclocking, as Fabian Giesen of RAD notes,

This stuff is not sanctioned and will count as overclocking if you try to RMA it but it's sold as a major feature of the platform and review sites test with it on.

Daniel Gibson replied with

hmm on my mainboard (ASUS ROG Strix B550-A Gaming -clearly gaming hardware, but middle price range) I had to explicitly enable the XMP/EXPO profile for the DDR4-RAM to run at full speed - which is DDR4-3200, officially supported by the CPU (Ryzen 5950X). Otherwise it ran at DDR4-2400 speed, I think? Or was it 2133? I forgot, at least significantly lower

To which Fabian noted

Correct. Fun fact: turning on EXPO technically voids your warranty ... t's great; both the CPU and the RAM list it as supported but it's officially not.

One might call it a racket, if one were inclined to such incisive language.

Intel didn't used to officially unofficially support this kind of thing. And, more generally, historically, CPU manufacturers were very hesitant to ship parts that had a non-negligible risk of crashes and data corruption when used as intended if they could avoid them, but more and more of these bugs keep happening. Some end up becoming quite public, like this, due to someone publishing a report about them like the RAD report above. And some get quietly reported to the CPU manufacturer by a huge company, often with some kind of NDA agreement, where the big company gets replacement CPUs and Intel or another manufacturer quietly ships firmware fixes to the issue. And it surely must be the case that some of these aren't really caught at all, unless you count the occasional data corruption or crash as being caught.

CPU internals series

Thanks to Leah Hanson, Jeff Ligouri, Derek Slager, Ralph Corderoy, Joe Wilder, Nate Martin, Hari Angepat, JonLuca De Caro, Jeff Fowler, and a number of anonymous tipsters for comments/corrections/discussion.

As with the Apple, Google, Adobe, etc., wage-fixing agreement, legal systems are sending the clear message that businesses should engage in illegal and unethical behavior since they'll end up getting fined a small fraction of what they gain. This is the opposite of the Becker-ian policy that's applied to individuals, where sentences have gotten jacked up on the theory that, since many criminals aren't caught, the criminals that are caught should have severe punishments applied as a deterrence mechanism. The theory is that the criminals will rationally calculate the expected sentence from a crime, and weigh that against the expected value of a crime. If, for example, the odds of being caught are 1% and we increase the expected sentence from 6 months to 50 years, criminals will calculate that the expected sentence has changed from 2 days to 6 months, thereby reducing the effective value of the crime and causing a reduction in crime. We now have decades of evidence that the theory that long sentences will deter crime is either empirically false or that the effect is very small; turns out that people who impulse commit crimes don't deeply study sentencing guidelines before they commit crimes. Ironically, for white-collar corporate crimes where Becker's theory might more plausibly hold, Becker's theory isn't applied. ^[return]
Something I find curious is how non-linear the level of effort of the attacks is. Google, Microsoft, and Amazon face regular, persistent, attacks, and if they couldn't trivially mitigate the kind of unsophisticated attack that's been severely affecting Linode availability for weeks, they wouldn't be able to stay in business. If you talk to people at various bay area unicorns, you'll find that a lot of them have accidentally DoS'd themselves when they hit an external API too hard during testing. In the time that it takes a sophisticated attacker to find a hole in Azure that will cause an hour of disruption across 1% of VMs, that same attacker could probably completely take down ten unicorns for a much longer period of time. And yet, these attackers are hyper focused on the most hardened targets. Why is that? ^[return]
The fault into microcode infinite loop also affects AMD processors, but basically no one runs a cloud on AMD chips. I'm pointing out Intel examples because Intel bugs have higher impact, not because Intel is buggier. Intel has a much better track record on bugs than AMD. IBM is the only major microprocessor company I know of that's been more serious about hardware verification than Intel, but if you have an IBM system running AIX, I could tell you some stories that will make your hair stand on end. Moreover, it's not clear how effective their verification groups can be since they've been losing experienced folks without being able to replace them for over a decade, but that's a topic for another post. ^[return]
See this code for a simple example of how to use Intel's API for this. The example is simplified, so much so that it's not really useful except as a learning aid, and it still turns out to be around 1000 lines of low-level code. ^[return]