Recently a number of different micro architecture attacks on Kernel Address Space Layout Randomization(KASLR) have been published. The idea behind KASLRis that code reuse attacks and read/write primitives are useless if the attacker is unable to tell where what code/data is. Consequently modern operating system randomizes the virtual addresses where code/data is stored. However it turns out the CPU leaks information about the address space layout causing failure in the defense. This blog post shortly summarizes the 4 known micro architecture attacks on KASRL and discusses potential mitigations. Part of the mitigation portion will be speculative. Bias up front: I’m a coauthor of the prefetch attack and quite like it.
What Do I mean by micro architecture KASLR break
When I wrote my first blog post  on breaking KASLR with micro architecture, I got criticized for using the term “micro architecture”. What I mean to convey when I use this term is that there is nothing in the Instruction Set Architecture (ISA) that specifies that the x86 has to work in ways that allows us to attack KASLR. Rather the features that are being used, are part of the implementation of the ISA. This usage of the term seems in line with common usage in the academic literature. There have been attacks on KASLR using page deduplication e.g. Barresi et al.  which some consider a microarchitecture attack. I however, do not and will thus not include these attacks in this discussion.
History of micro architecture KASLR breaks
Using page faults
Hund, Willems & Holz  was the first to publish an attack on KASLR. They use a combination of a classical prime+probe cache attack and a novel attack on the paging caches the time it takes for a fault to be thrown on unprivileged access to a page on x86-64 differs significantly depending on weather the translation is cached in the paging caches. An attacker can measure this difference. Further even an unprivileged access to a page causes the translation caches to be loaded and consequently you can use a timing difference attack to tell if memory is mapped on a certain location. They called this the “double page fault” attack.
I wrote a blog post in January  that become my starting for the paper by Gruss et al . The Hund, Willems & Holz  attack suffers from the drawback that the timing is measure across the page fault handler which adds significant noise. This makes the attack itself slow and due to noise the attack needs to be repeated a significant amount of times to become reliable. What I and my coauthors noted was that the prefetch instructions do not cause page faults – they terminate silently. They do however have the same timing properties, as they do the same look ups. Consequently replacing page faults with the prefetch instruction provides a faster and more reliable KASLRbreak.
Wojtczuk wrote in a blog post  that intel TSX technology might make Hund, Willems & Holz’s  attack significantly more practical. Jang et al.[ 2] took this up and showed Wojtczuk  was right. TSX works by being able to roll back a set of memory accesses, if they cannot be completed immediately due to conflicting access in the relevant portions of the memory sub system – much like transactional access to databases. However, access to a page that would normally trigger a page fault inside a transaction now just causes a rollback. Consequently, a significant speed up and reliability gain can be had by wrapping your “probe” in a TSX section.
Using Branch Target Buffer
The fourth and fairly different attack came using the Branch Target Buffer, BTB. Evtyushkin, Ponomarev & Abu-Ghazaleh  found out that intel caches branch targets in order to better branch prediction in BTB. Correctly predicting branches is important to performance because a failure to predict correctly will under utilize the entire instruction pipeline. Intel however is cheap with the bits and uses only the lower 31 bits to store a branch target in the cache. Since KASLRis typically build on entropy in the lower 31 bits and these bits are shared between kernel and user mode, an attacker can cause conflicts. These conflicts will cause measurable timing differences.
Despite the first three methods evolving around the same theme they do have significantly different advantages and caveats. The unique BTB method obviously also have diverging properties. I shall stress the advantages and problems of each vector using Hund, Willems & Holz paper  as a baseline. This classical method now appears mostly obsolete. That said I consider the paper a milestone for many reasons and it’s well worth a read.
The Advantages of the prefetch vector
Using the prefetch instruction is, as noted above, faster and more reliable. Not only does it make it more reliable, but also more accurate. Without the noise one can spot a timing difference depending on the level in the page table hierarchy an abort was caused. This allows an attacker to efficiently map out the entire address space top down starting out with 1 gigabyte regions. Further the prefetch instruction actually ignores the page table security settings and prefetches an address into a cache. This allows an attacker to recover physical addresses and by pass SMEP & SMAP when the operating system is using a direct mapping. The prefetch instructions are available on all current CPU’s, as they were introduced back in 1998.
The TSX vectors’s advantages
Like the prefetch attack, no fault handler is called and consequently the attack has significantly reduced the noise. The TSX attack, unlike the prefetch, can trigger code access through execution using a jump or call instruction. This has significantly different timing properties and consequently allows the attacker to distinguish between code and data. As a downside it cannot use the cache loading properties of the prefetch instruction. A major drawback of the TSX vector is the installed base of TSX which is somewhat limited. It was officially introduced with Haswell micro architecture. To this day TSX is only available in high end computers: some I5, I7 and xeon’s. This does not yet include the U series which is the most popular processor for notebooks.
The BTB attack is probably the most limited attack, as it is only able to locate code and data where branches give away the address. For all its limits in this regard it is probably the fastest of the attacks. Further Felix Wilhelm  showed that since the attack does not rely on page tables, it translates to host-KASLR breaks too.
Intel has in the past not been prone to do anything about micro architectural attacks and consequently I have significant doubt that they will with KASLR attacks. In fact the segment limits that were part of the x86-32 architecture would be a shop stopper for the first three attack vectors, but was removed for performance with x86-64.
It is often suggested to make the timer (RDTSC(p) ) less accurate will mitigate micro architectural attacks which covers all 4 attacks vectors here. The cost is obviously some implementation overhead and a less accurate timer for benign purposes. It could be implemented through the CR4.TCD bit for bare metal operating systems or setting up the MSR’s to get a VMEXIT in hypervisor environments. These cost seems reasonable to me – I’ve yet to hear a convincing argument why user mode applications in general, need such timing accuracy and I never heard a convincing argument that any application would be significantly slowed down by the overhead required to get less accuracy on RDTSC(p) instruction. That said lower timing accuracy is not a good mitigation in cases where an attack can be repeated. The reason is that once an attacker can repeat a timing attack, he can use the law of large numbers to cancel out any noise or bias added to the timer. For KASLR attacks, there is little reason why an attacker cannot repeat the attack and in fact the Hund, Willems & Holz  attack does this to great effect, not because of a noisy timer, but because a the noise added by the page fault handler. I thus think it’s safe to conclude that a noisy timer cannot be a solution to KASLR micro architecture attacks.
The Hund, Willems & Holz  attack is fairly easy to mitigate. Since they actually trigger a page fault, one has a nice call back in the page fault handler. A few lines of code can use client CS and error code on the stack and the linear address in CR2 to check if a user mode program tried to access kernel memory and respond accordingly. As this event should be really rare, the response can be anything from causing a large delay making the attack impractical or shutting down the application.
The in Jang et al.  version of the TSX attack ,DrK, use page mapping (number of continuously mapped pages) and page attributes to identify modules. This can easily be thwarted by mapping unused pages in the driver region to a single PFN with only null data. The cost shouldn’t be too bad as only one PFN needs to be referenced mapped from a number of PTE which to some extend can use existing PDE’s. Exactly how many additional PDE’s would be needed to thwart the DrK would be a tuning parameter. This method would also slow down the prefetch attack, however currently it uses a function call to load the caches and thus cannot be completely thwarted in this fashion. It is likely that the TSX vector could be made to work in a similar fashion.
Both the TSX vector and the BTB vector is vulnerable to performance counter detection as there are currently performance counters that maps to micro events caused by the attack. Say for instance the detection I suggested for the Flush+Flush  attack could easily be modified to detect both these attacks. The prefetch attack however, does not seem to have a performance counter event that correlate sufficiently with the attack. The performance counters on software prefetch’s was discontinued and remapped with the by now very old Silvermont generation of processors. These kinds of detections are unfortunately either high-overhead or false positive prone. Consequently they don’t off themselves as general solutions, but rather of a close gap nature. Without a good performance counter mapping to the prefetch instruction, reliable detection would likely be prohibitively expensive performance wise using these methods.
The BTB and TSX vectors offer other ways of detection. In particular it seems likely that both will generate activity in the Last Branch Record (LBR) stack. If one were to instrument the RDTSC(p) instruction, as suggested above, the distance between the last RDTSC(p) instruction and the current combined with the LBR would provide clues to nefarious activity quite possibly at a fairly low cost.
More promising is a BTB only trick that relies on the very thing that makes the BTB attack work in the first place. The BTB vector works because only the lower bits are used to store the branch target and thus you get conflicts between kernel and user mode branches. Interestingly this conflict consequently only gives you information on the lower 31 bits and Evtyushkin and Ponomarev  notes that using the upper bits for KASLR thus remains a solution. Using all the remaining addressing bits f or KASLR seems a good idea, however all modern operating system sorts different types of code and data into blocks and partically uses these blocks to identify maglinant behavior. For instance drivers in Windows are loaded between 0xFFFFF800’000000000 and 0xFFFFF8FF’FFFFFFFF. Thus to minimize changes to this general structure using only the upper 9 bits (the PML4 bits) would probably be the easiest solution. This would give you 256 different locations which the BTB method could not tell apart (the top most bit is used to distinguish between kernel mode and user mode). Not much, but an attacker is unlikely to get away with causing a machine a 100 times to get it right be random. In fact Windows 10 Anniversary Edition does this for PTE’s, but not for code. See Economou  for more info.
The three vectors based on actually accessing kernel memory can be mitigated through address space separation. By treating kernel mode as a separate “process” and swapping the CR3 register immediately after/before a ring change, one can effectively block these attacks. Depending on the amount of ring switching going on, an operating system are looking at about 5% over head as switching the CR3 registers are expensive in and of itself and because it invalidates the paging caches. Both Yang et al  and Gruss et al.  mentions this method.
A mitigation that I considered for thwarting the BTB attack was flushing the BTB itself, by systematically doing a number of jmp instructions on switching back into user mode. This method is likely to have too much performance overhead to be practical. The reason for this method is the size of the BTB, which Godbolt  estimates the BTB in his Haswell to be 2048 entries large – thus requiring 2048 dummy branch instructions on every transition into user mode.
The natural question, which of these methods is the best, is a misunderstood question. While the first method appears obsolete by now, the other 3 methods have different advantages and disadvantages. They base themselves on at least 2, but probably at least 3 different side channels in the CPU and thwarting all four attacks seems like a difficult undertaking. Personally I find it likely that there are more leaking in the processor to be found – maybe I’ll write about that another time. The natural conclusion is that KASLRfor bare metal operating systems seems to be broken for defending against privilege escalation mounted from native code. The caveat here is the “from native code part”. Classical data based attacks, such as a font attacks of the past cannot be doctored to break KASLRusing these micro architectural attacks. Before I cry any crocodile tears ,I should note that KASLR was always a stop gap solution and consequently mitigations become mitigations for a mitigation. The real problem being bugs in kernel mode – though very difficult to fix all kernel mode bugs, minimizing code in the kernel and improving kernel mode code quality must remain the focus of making the kernel secure.
A Barresi , K Razavi, M Payer, T Gross: “CAIN: Silently Breaking ASLR in the Cloud“ – https://nebelwelt.net/publications/files/15WOOT.pdf
 Y Jang, S Lee, T Kim , “Breaking Kernel Address Space Layout Randomization with Intel TSX”
– Proceedings of the 23rd ACM Conference on Computer and Communications Security
 R Wojtczuk.” TSX Improves Timing Attacks Against KASLR.” https://labs.bromium.com/2014/10/
 M Godbolt. “The BTB in contemporary Intel chips.” http://xania.org/201602/bpu-part-three
 Evtyushkin D, Ponomarev D, Abu-Ghazaleh N. Jump Over ASLR: Attacking Branch Predictors to Bypass ASLR. InProceedings of 49th International Symposium on Microarchitecture (MICRO) 2016.
 D Gruss, C Maurice, A Fogh, M Lipp, S Mangard . “Prefetch Side-Channel Attacks: Bypassing SMAP and Kernel ASLR.” Proceedings of the 23rd ACM Conference on Computer and Communications security 2016
 A Fogh. “Detecting stealth mode cache attacks: Flush+Flush. http://dreamsofastone.blogspot.de/2015/11/detecting-stealth-mode-cache-attacks.html
 A Fogh: “Breaking KASRL with micro architecture Part 1”, http://dreamsofastone.blogspot.de/2016/02/breaking-kasrl-with-micro-architecture.html
 R Hund, C Willems, T Holz. „Practical timing side channel attacks against kernel space ASLR” Security and Privacy (SP), 2013 IEEE Symposium on, 191-205
 N Economou. “Getting Physical: Extreme abuse of Intel based Paging Systems” -https://blog.coresecurity.com/2016/08/25/getting-physical-extreme-abuse-of-intel-based-paging-systems-part-3-windows-hals-heap/
 F Wilhelm “mario_baslr” https://github.com/felixwilhelm/mario_baslr/