CPU backdoors

It’s generally accepted that any piece of software could be compromised with a backdoor. Prominent examples include the Sony/BMG installer, which had a backdoor built-in to allow Sony to keep users from copying the CD, which also allowed malicious third-parties to take over any machine with the software installed; the Samsung Galaxy, which has a backdoor that allowed the modem to access the device’s filesystem, which also allows anyone running a fake base station to access files on the device; Lotus Notes, which had a backdoor which allowed encryption to be defeated; and Lenovo laptops, which pushed all web traffic through a proxy (including HTTPS, via a trusted root certificate) in order to push ads, which allowed anyone with the correct key (which was distributed on every laptop) to intercept HTTPS traffic.

Despite sightings of backdoors in FPGAs and networking gear, whenever someone brings up the possibility of CPU backdoors, it’s still common for people to claim that it’s impossible. I’m not going to claim that CPU backdoors exist, but I will claim that the implementation is easy, if you’ve got the right access.

Let’s say you wanted to make a backdoor. How would you do it? There are three parts to this: what could a backdoored CPU do, how could the backdoor be accessed, and what kind of compromise would be required to install the backdoor?

Starting with the first item, what does the backdoor do? There are a lot of possibilities. The simplest is to allow privilege escalation: make the CPU to transition from ring3 to ring0 or SMM, giving the running process kernel-level privileges. Since it’s the CPU that’s doing it, this can punch through both hardware and software virtualization. There are a lot of subtler or more invasive things you could do, but privilege escalation is both simple enough and powerful enough that I’m not going to discuss the other options.

Now that you know what you want the backdoor to do, how should it get triggered? Ideally, it will be something that no one will run across by accident, or even by brute force, while looking for backdoors. Even with that limitation, the state space of possible triggers is huge.

Let’s look at a particular instruction, fyl2x1. Under normal operation, it takes two floating point registers as input, giving you 2*80=160 bits to hide a trigger in. If you trigger the backdoor off of a specific pair of values, that’s probably safe against random discovery. If you’re really worried about someone stumbling across the backdoor by accident, or brute forcing a suspected backdoor, you can check more than the two normal input registers (after all, you’ve got control of the CPU).

This trigger is nice and simple, but the downside is that hitting the trigger probably requires executing native code since you’re unlikely to get chrome or Firefox to emit an fyl2x instruction. You could try to work around that by triggering off an instruction you can easily get a JavaScript engine to emit (like an fadd). The problem with that is that if you patch an add instruction and add some checks to it, it will become noticeably slower (although, if you can edit the hardware, you should be able to do it with no overhead). It might be possible to create something hard to detect that’s triggerable through JavaScript by patching a rep string instruction and doing some stuff to set up the appropriate “key” followed by a block copy, or maybe idiv. Alternately, if you’ve managed to get a copy of the design, you can probably figure out a way to use debug logic triggers2 or performance counters to set off a backdoor when some arbitrary JavaScript gets run.

Alright, now you’ve got a backdoor. How do you insert the backdoor? In software, you’d either edit the source or the binary. In hardware, if you have access to the source, you can edit it as easily as you can in software. The hardware equivalent of recompiling the source, creating physical chips, has tremendously high fixed costs; if you’re trying to get your changes into the source, you’ll want to either compromise the design3 and insert your edits before everything is sent off to get manufactured, or compromise the manufacturing process and sneak in your edits at the last second4.

If that sounds too hard, you could try compromising the patch mechanism. Most modern CPUs come with a built-in patch mechanism to allow bug fixes after the fact. It’s likely that the CPU you’re using has been patched, possibly from day one, and possibly as part of a firmware update. The details of the patch mechanism for your CPU are a closely guarded secret. It’s likely that the CPU has a public key etched into it, and that it will only accept a patch that’s been signed by the right private key.

Is this actually happening? I have no idea. Could it be happening? Absolutely. What are the odds? Well, the primary challenge is non-technical, so I’m not the right person to ask about that. If I had to guess, I’d say no, if for no other reason than the ease of subverting other equipment.

I haven’t discussed how to make a backdoor that’s hard to detect even if someone has access to software you’ve used to trigger a backdoor. That’s harder, but it should be possible once chips start coming with built-in TPMs.

If you liked this post, you’ll probably enjoy this post on CPU bugs and might be interested in this post about new CPU features over the past 35 years.

Updates

See this twitter thread for much more discussion, some of which is summarized below.

I’m not going to provide individual attributions because there are too many comments, but here’s a summary of comments from @hackerfantastic, Arrigo Triulzi, David Kanter, @solardiz, @4Dgifts, Alfredo Ortega, Marsh Ray, and Russ Cox. Mistakes are my own, of course.

AMD’s K7 and K8 had their microcode patch mechanisms compromised, allowing for the sort of attacks mentioned in this post. Turns out, AMD didn’t encrypt updates or validate them with a checksum, which lets you easily modify updates until you get one that does what you want.

Here’s an example of a backdoor that was created for demonstration purposes, by Alfredo Ortega.

For folks without a hardware background, this talk on how to implement a CPU in VHDL is nice, and it has a section on how to implement a backdoor.

Is it possible to backdoor RDRAND by providing bad random results? Yes. I mentioned that in my first draft of this post, but I got rid of it since my impression was that people don’t trust RDRAND and mix the results other sources of entropy. That doesn’t make a backdoor useless, but it significantly reduces the value.

Would it be possible to store and dump AES-NI keys? It’s probably infeasible to sneak flash memory onto a chip without anyone noticing, but modern chips have logic analyzer facilities that let you store and dump data. However, access to those is through some secret mechanism and it’s not clear how you’d even get access to binaries that would let you reverse engineer their operation. That’s in stark contrast to the K8 reverse engineering, which was possible because microcode patches get included in firmware updates.

It would be possible to check instruction prefixes for the trigger. x86 lets you put redundant (and contradictory) instruction prefixes on instructions. Which prefixes get used are well defined, so you can add as many prefixes as you want without causing problems (up to the prefix length limit). The issues with this are that it’s probably hard to do without sacrificing performance with a microcode patch, the limited number of prefixes and the length limit mean that your effective key size is relatively small if you don’t track state across multiple instructions, and that you can only generate the trigger with native code.

As far as anyone knows, this is all speculative, and no one has seen an actual CPU backdoor being used in the wild.

Acknowledgments

Thanks to Leah Hanson for extensive comments, to Aleksey Shipilev and Joe Wilder for suggestions/corrections, and to the many participants in the twitter discussion linked to above. Also, thanks to Markus Siemens for noticing that a bug in some RSS readers was causing problems, and for providing the workaround. That’s not really specific to this post, but it happened to come up here.


  1. This choice of instruction is somewhat, but not completely, arbitrary. You’ll probably want an instruction that’s both slow and microcoded, to make it easy to patch with a microcode patch without causing a huge performance hit. The rest of this footnote is about what it means for an instruction to be microcoded. It’s quite long and not in the critical path of this post, so you might want to skip it.

    The distinction between a microcoded instruction and one that’s implemented in hardware is, itself, somewhat arbitrary. CPUs have an instruction set they implement, which you can think of as a public API. Internally, they can execute a different instruction set, which you can think of as a private API.

    On modern Intel chips, instructions that turn into four (or fewer) uops (private API calls) are translated into uops directly by the decoder. Instructions that result in more uops (anywhere from five to hundreds or possibly thousands) are decoded via a microcode engine that reads uops out of a small ROM or RAM on the CPU. Why four and not five? That’s a result of some tradeoffs, not some fundamental truth. The terminology for this isn’t standardized, but the folks I know would say that an instruction is “microcoded” if its decode is handled by the microcode engine and that it’s “implemented in hardware” if its decode is handled by the standard decoder. The microcode engine is sort of its own CPU, since it has to be able to handle things like reading and writing from temporary registers that aren’t architecturally visible, reading and writing from internal RAM for instructions that need more than just a few registers of scratch space, conditional microcode branches that change which microcode the microcode engine fetches and decodes, etc.

    Implementation details vary (and tend to be secret). But whatever the implementation, you can think of the microcode engine as something that loads a RAM with microcode when the CPU starts up, which then fetches and decodes microcoded instructions out of that RAM. It’s easy to modify what microcode gets executed by changing what gets loaded on boot via a microcode patch.

    For quicker turnaround while debugging, it’s somewhere between plausible and likely that Intel also has a mechanism that lets them force non-microcoded instructions to execute out of the microcode RAM in order to allow them to be patched with a microcode patch. But even if that’s not the case, compromising the microcode patch mechanism and modifying a single microcoded instruction should be sufficient to install a backdoor.

    [return]
  2. For the most part, these aren’t publicly documented, but you can get a high-level overview of what kind of debug triggers Intel was building into their chips a couple generators ago starting at page 128 of Intel Technology Journal, Volume 4, Issue 3. [return]
  3. For the past couple years, there’s been a debate over whether or not major corporations have been compromised and whether such a thing is even possible. During the cold war, government agencies on all sides were compromised at various levels for extended periods of time, despite having access to countermeasures not available to any corporations today (not hiring citizens of foreign countries, “enhanced interrogation techniques”, etc.). I’m not sure that we’ll ever know if companies are being compromised, but it would certainly be easier to compromise a present-day corporation than it was to compromise government agencies during the cold war, and that was eminently doable. Compromising a company enough to get the key to the microcode patch is trivial compared to what was done during the cold war. [return]
  4. This is another really long footnote about minutia! In particular, it’s about the manufacturing process. You might want to skip it! If you don’t, don’t say I didn’t warn you.

    It turns out that editing chips before manufacturing is fully complete is relatively easy, by design. To explain why, we’ll have to look at how chips are made.

    Cross section of Intel chip, 22nm process

    When you look at a cross-section of a chip, you see that silicon gates are at the bottom, forming logical primitives like nand gates, with a series of metal layers above (labeled M1 through M8), forming wires that connect different gates. A cartoon model of the manufacturing process is that chips are built from the bottom up, one layer a time, where each layer is created by depositing some material and then etching part of it away using a mask, in a process that’s analogous to lithographic printing. The non-cartoon version involves a lot of complexity – Todd Fernendez estimates that it takes about 500 steps to create the layers below “M1”. Additionally, the level of precision needed is high enough that the light used to etch causes enough wear in the equipment that it wears out. You probably don’t normally think about lenses wearing out due to light passing through them, but at the level of precision required for each of the hundreds of steps required to make a transistor, it’s a serious problem. If that sounds surprising to you, you’re not alone. An ITRS roadmap from the 90s predicted that by 2016, we’d be at almost 30GHz (higher is better) on a 9nm process (smaller is better), with chips consuming almost 300 watts. Instead, 5 GHz is considered pretty fast, and anyone who isn’t Intel will be lucky to get high-yield production on a 14nm process by the start of 2016. Making chips is harder than anyone guessed it would be.

    A modern chip has enough layers that it takes about three months to make one, from start to finish. This makes bugs very bad news since a bug fix that requires a change to one of the bottom layers takes three months to manufacture. In order to reduce the turnaround time on bug fixes, it’s typical to scatter unused logic gates around the silicon, to allow small bug fixes to be done with an edit to a few layers that are near the top. Since chips are made in a manufacturing line process, at any point in time, there are batches of partially complete chips. If you only need to edit one of the top metal layers, you can apply the edit to a partially finished chip, cutting the turnaround time down from months to weeks.

    Since chips are designed to allow easy edits, someone with access to the design before the chip is manufactured (such as the manufacturer) can make major changes with relatively small edits. I suspect that if you were to make this comment to anyone at a major CPU company, they’d tell you it’s impossible to do this without them noticing because it would get caught in characterization or when they were trying to find speed paths or something similar. One would hope, but actual hardware devices have shipped with backdoors, and either no one noticed, or they were complicit.

    [return]