End of May I had the opportunity to present my research on cache side channel attacks at the “Hack In The Box” conference. After my presentation with Nishat Herath last year at black hat I published my private comments to that slide deck and that was well received. I had decided to do that again for “Hack In The Box”, unfortunately it took me a little longer to translate my comments into something human readable. But here they are. Since the comments relate directly to a specific slide in the slide deck you’ll probably want to have the slide deck open when reading this blog post. You can find them here: https://conference.hitb.org/hitbsecconf2016ams/materials/D2T1%20-%20Anders%20Fogh%20-%20Cache%20Side%20Channel%20Attacks.pdf
Cache side channel attacks: CPU Design as a security problem
The title of the presentation took quite a while to figure out because I wanted a title that fairly and accurately described the content of the presentation, but one that was also slightly catchy. I like what I came up with.
Here I told how I got into cache side channel attacks as an introduction. I was working on a talk about row hammer detection, when Daniel Gruss tweeted he’d been able to row hammer in Java script. Consequently, I had to figure out how he did it, so that I could speak with confidence. Everything pointed towards the cache side channel attack literature, so I dug in and 4 hours later produced my first cache side channel attack. This side channel attack triggered my row hammer detection and that let me into a year worth of playing with CPU caches.
When I start doing slides, I always think about what I want to come across to the listener well knowing that most people will have forgotten most two hours later.
1) I wanted to point out that safe software running on unsafe hardware is unsafe – something I think is too often forgotten. Further I wanted to make the point that we are not talking about a “bug” in the sense that an intel engineer made a mistake. Rather the cache system generally works as designed – though this design has unintended consequences: cache side channel attacks. Specifically:
- Attacker can manipulate the cache: Shared L3 (between cores & trust zones) + Inclusive cache hierarchy
- The cache state can be queried and give back a lot of valuable information: Timing attacks (on access & clflush), Shared L3, N-Way-Set associative cache, complex addressing function for L3 that application access to sets.
2) Here I just wanted to point out that this isn’t just an academic issue. In many ways cache side channel attacks, can be a real world alternative to VM breakout and privilege escalation attacks – or even a method to support either of those.
3) I wanted to make the point that we can do something about these critters – even when the attacker adopts.
- The most important things I wanted to get across was CSCs are an attack methodology and not an attack. Further that the cache design is not a requirement to be an x86, that is the cache could be designed differently and be less useful for cache side channel attacks without causing software incompatibility.
- The bandwidth of the side channel is essentially the bandwidth of the cache and thus a lot of valuable data can be leaked. The size of the caches makes the retention time of the information high and thus allows us to actually read out the data in most cases (say compare to D’Atoine’s attack on instruction reordering where the information is lost immediately if not queried). Finally, software cannot really avoid using the cache and thus leak information.
- Here I wanted to point out that intel is unlikely to provide a fix. They won’t replace old processors, micro code update is unlikely and Intel have brought out CPU’s for a while without fixing. Leading us towards software fixes for a hardware problem. This is a situation which is always suboptimal, but short to medium term the only option. It was my intention to show that while suboptimal entirely within the realm of the possible.
Litterature: ARM: , TLB, RowBuffer
L3 is not small
Point here was to give a sense of scale on the importance of the L3 cache in the CPU. Also it’s kind of neat you can actually see slices of the L3 cache in a die shot.
Why is this interesting
The point of this slide was to illustrate the flexibility and power of CSC’s. Important to notice here: CSC’s are always attack on software implementation – not algorithms. An RSA implementation might leak it’s private key, another implementation may not.
Literature: Covert channels , RSA, AES, ECDSA,Keyboard,Mouse, KASRL
How the data cache works on Intel + Important cache features
Here I wanted to introduce the design features that allows the attacker to manipulate the cache hierarchy. The second slide is just a summary slide.
How memory is stored in L3 + N-Way Set Associative L3
Here I wanted to introduce how the cache stores data. This is further back ground for manipulation, but also serves as a back ground for getting information from the cache. The “intel’s problem” is a so called fully associative cache on bytes. Essentially cache line concept limits accuracy of the attacks and the N-Way set associative cache allows an attacker to work on much smaller units making attacks practical. Further it allows an attacker to deduct information from cache-set congruence, something that is supported by the complex addressing function for the cache. I avoided the complex addressing function deliberately to avoid complexity and just mentioned that “the colors are not clumped together, but spread out”. Cache side channel attacks has historically only approximated this complex addressing function by doing a simple timing attack – loading N+1 addresses from the same set will cause one not to be cached and thus accessing it will be slow. The complex addressing function for cache sets has been reverse engineered resonantly for most intel CPU’s.
Example code to attack
The example code I use to explain the attacks is inspired by a function in the GDK library which is significantly more complex than this. I wanted to underline that code as well as data can be attacked and that branches matter. I wanted to speak about the big 3 attacks in the following slides by highlight some differences in a concrete example. The GDK library was first attacked by .
Common for all CSC
Here the goal was to introduce timing attack and all the cache manipulation primitives of the big 3 caches. Evict is just accessing enough cache lines congruent to a cache set to cause all previously stored information to be evict. Prime is a variation on Evict – where you carefully access cache lines congruent to a cache set in a way that the attacker knows what is in the cache set. The flush primitive is simply the clflush instruction, that removes a specific cache line from the cache hierarchy.
Big 3 Cache side channel Attacks
I wanted to comment that the 3 attacks and the variants of them make up most CSC’s and that they are all fairly similar in structure. Also I wanted to point back to the primitives described in the previous slide.
Literature: Evict+Time, Prime+Probe, Flush+Reload
“The 4 slides on the attacks”
In these slides I wanted to highlight some advantages and disadvantages for each attack. The Evict+Time does not give any temporal resolution – which I call “post mortem analysis”, but you don’t need to worry about synchronizing with victim. Synchronizing can be done by any means including a separated CSC, a function call or any other information leak. Or even spying constantly. Though the accuracy is cache congruence it should probably be noted that prime and probe is disturbed by any cache set congruent memory operation whereas Evict+Time is only disturbed by those who evict exactly those cache lines used by the function. However, calling a function tends to bring more non-relevant code into play and thus noise. Noise is a more complex topic that the impression the overview slide gives.
I heard the „nobody shares memory“comment one too many times and wanted to raise a flag that it isn’t that uncommon. Finally, I wanted to warn against shared memory – particularly as it’s introduced through deduplication as that’s the most common vector in VM environments. Flush+Reload is just an example of the problems with dedublication, but one can can pile on with D’Antoine’s instruction reordering attack, dedublication attacks, more speculatively row hammer can become a practical cross VM vector with deduplication, row buffer side channel attacks etc. etc.
I wanted to point out that CSC’s are noisy. Usually the noise is due to contention with irrelevant code running on multicore CPU’s or contention in other subsystem than the cache. Also the hardware prefetcher can destroy things for an attacker. Of these effects only the effect of the hw prefetcher cannot be eliminated by repeating the attack – though obviously not all attacks lend themselves to be repeated (you cannot ask a user for his password 10k times). Sometimes you get good results in the first attempt. I had an Evict+Time attack that required more than 10k attempts to give satisfying results.
My „agenda“on my blackhat talk last year was to communicate that performance Counters can be an important tool for security related weirdness in the CPU. Part of this slide is an attempt to repeat that message. The important part was to introduce performance counters, performance counter interrupt and setting up the counter as they form an important part of my work on detecting cache side channel attacks.
Flush + Reload code
Here the Clflush is used as a “manipulate cache to a known state primitive”.
The Mov instruction (which could be replaced by most memory access instructions) is the reload phase. The RDTSC(P) does the actual timing of the reload phase and the mfence instructions prevents that the CPU reorder the instructions to do the reload phase outside the rdtsc(p) bracket.
My comments read that I should explain that cache side channel attacks like flush+reload is in a race condition with the process being spied upon.–Say if we’re attacking keyboard input we’ll be less visible if we wait a few milliseconds per iteration because nobody types that fast whereas for crypto we almost always need much higher temporal resolution and usually wouldn’t wait at all.
My original suggestion was to see how long a given count of cache misses would take. If too fast we had a cache side channel attack.  and  improved on that. All works fairly well.
Row hammer was flush + reload
Just noting here that if we remove the information acquisition and do it for two addresses at once we have the original row hammer code. It’s then easy to see that row hammer was a flush+reload attack. The “was” word was carefully chosen. Others has shown that the movnti instruction is a vector for row hammer too, and that vector is not a row hammer related attack. To round off my introduction I hope I mentioned that rowhammer.js was a flush+reload variation that I (and others) usually call Evict+Reload using the eviction primitive I discussed in a previous slide.
Flush + Flush
The back story here is I’d figured that clflush would leak information about the cache state and when approached by Daniel Gruss and Clementiné Maurice about detecting a cache attack that causes less cache misses I immediately knew, what they were talking about. Instead of competing to publish I did not do more work in this direction. I did get to review their wonderful paper though.
Flush+flush works with replacing the mov instruction in flush+reload with a clflush but is otherwise identical. The clflush instruction is probably faster when the cache line parameter isn’t in the cache because it’s able to shortcut actually flushing the cache.
Flush+flush has an advantage beyond stealthiness: clflush is faster than mov on uncached memory. Also it leaves the cache in a known state which means the first line of code can be skipped when iterating the attack. This attack is probably the fastest CSC. Also the clflush instruction introduces less problems with the hardware prefetchers. Literature: 
Why is flush+flush Stealth
Clflush does not cause cache misses! However, the victim still incurs cache misses due to flush and reload. This usually isn’t sufficient for the flush+reload detection I outline in previous slides to get decent detection rates without incurring significant amount of false positives.
The first point is an assumption that we’re not talking about a cross VM attack. My opinion is that cross VM flush+flush attacks should be foiled by not using dedublication. It’s also an assumption that I need. In a cross VM attack the attacker could use ring 0 to attack and thus bypass the CR4.TSD access violation. However, it is worth noting that even in this scenario it would make flush+flush more difficult.
The other 3 points are just the technology I use to catch flush+flush with.
This is actually a revamped version of my “Detecting flush+flush” blog post. After posting that I had some private conversations where my method got criticized for being too costly performance wise. So I first tried to find a better performance counter. There was an event that actually mentioned CLFLUSH in the offcore counters, unfortunately it was only available on a small subset of microarchitectures that I deemed too small to be worthwhile. I actually tried to see if intel just changed documentation and it appears they really revamped that event for other purposes. Then I played around with the CBO events in the cache system, though I could get better than cache-ref it was at the cost of being gameable from an attacker view point. So instead I decided to make a two stage approach – first detect the bracket and then detect clflush instruction. I had two reasons for this approach. The first is to deal with the performance impact of the second stage which can be quite severe. The second is I have a feeling that we’ll see other instruction-timing attacks in the months to come and this two stage model maps well to defending this kind of problem. The reason why this two stage works better is that the RDTSC instruction itself is rare, but in pairs spaced close enough for an attacker to not drown in noise is so rare that I’ve not seen a single benign application causing this in my testing.
Using CR4.TSD to make rdtsc privileged affects performance two fold. First it causes a very significant slowdown on the RDTSC instruction with emulating it in an exception handler. However, the RDTSC is rarely used – in particularly rarely used in a fashion where it makes up a significant part of the total execution time and thus I think the performance penalty is acceptable. Much worse is that the RDTSC instruction becomes less accurate which might cause issues. Short of profiling code, I’ve yet to come up with a benign case where such accuracy would be required. I may be wrong though.
The detection obviously hinges strongly around the RDTSC being used as a timer. I’ve yet to come up with a good alternative to rdtsc, but that doesn’t mean it doesn’t exists.
The Gory details left out is mostly eviction policy and complex addressing function related stuff. Such as finding eviction sets and priming the cache. There are many suggestions for mitigation – not sharing memory being one that’s very good but incomplete as it doesn’t work for evict+time and prime+probe. There are different ways of adding noise or making the timer less accurate – all in my opinion fundamentally flawed as noise or less timer accuracy can be defeated by law of large numbers. The Cache Allocation Technology can be used to protect against CSC’s – here I’m not a fan because it requires software to be reauthored to be protected and that it has limited coverage. Finally, it’s worth mentioning writing branch-free-cache-set-aware code which actually works, but is difficult and we’re unlikely to be able to demand that all affected software does so.