In debt to Retpoline

 

Appendix was added on the 14th of Febuary 2018, in response to comments made to me on twitter. In this connection “retpoline pause lfence” and “retpoline ud2” was added to the table. Other than that only typos where fixed since original post.

Abstract

In this blog post I explore the Retpoline mitigation that Paul Turner of Google suggested for the Spectre indirect branch variant issue [1]. A short differential side channel analysis is made along with a performance analysis. The impact of the use of a pause instruction in retpoline is discussed. Finally I consider the technical debt leveraged on CPU developers of a widely deployed retpoline.

 

How does retpoline work

This section follows the google presentation of retpoline closely [1]. I included it because it provides the context for the remainder of this blog post. A jmp rax is turned into the following code:

 

  1. call set_up_target;  
  2. capture_spec:     
  3. pause;                  
  4. jmp capture_spec
  5. set_up_target:  
  6. mov [rsp],rax   
  7. ret;          

call set_up_target (1) pushes the address of capture_spec(2) to the return stack buffer (RSB, cpu internal buffer used to predict returns). It also pushes the address of capture_spec(2) to the program stack and transfers control to set_up_target (5).  The mov [rsp],rax overwrites the return stack address on the program stack, so that any subsequent returns will pop the stack and return to the original target in its architectural path. The speculative path of the return remains the value from the RSB and thus the CPU will speculatively execute the code starting at (2). The pause instruction is meant to relinquish pipeline resources to co-located hyperthreads and save power if no co-located hyperthreads are present. The jmp in line 4 continues to repeat the pause until the speculative execution is rolled back when the ret in line 7 finished executing.

 

In simple terms this means the speculative path will resolve to a spinlock thus not leak any information through a side channel. The architectural path will eventually resolve correctly and the program will run as it is supposed to.

 

Side channel analysis of retpoline

Retpoline’s architectural side channels consist of flushing an entry in the RSB caused by the call in line (1). This information is unlikely to bring an attacker much advantage and would require an attack to have a sufficient amount of control of the RSB. The 6th line makes a store access on the stack. This is visible through a cache side channel to an attacker — provided the attack has sufficient control. However, it is unlikely to provide much information given that the stack is usually used by the functions themselves. It is worth noting that any information this side channel can provide is also provided by the jmp rax (6). Presumably retpoline’s stack access provides slightly less information to an attacker than the original jmp rax instruction – i.e. BTB indexing bits vs. cache set indexing bits. The 7th line is more tricky: if the CPU updates the BTB after a misprediction, the BTB side channel will be similar useful to the original jmp rax. I think it is likely that the BTB is not updated until retirement, so that this side channel isn’t present. Since the pause instruction (line 3) is a CPU hint, the CPU may choose to take or ignore the hint at its discretion – more on this below. Thus, pause may or may not provide a side channel.  If external power is plugged in and the hint is taken one can see a co-located hyperthread speed up (see [2]), which is the purpose of having the pause instruction here, but certainly a side channel as well. If the CPU is running in power saving mode (e.g. unplugged laptop), pausing provides a side channel since executing a pause instruction on two co-located hardware threads causes a delay for both, presumably through C-State interaction. See[3]. In sum, I think the side channels provided by retpoline is less valuable than the side channel provided by the original indirect branch (but is still present).

 

Performance analysis of retpoline

A jmp rax instruction takes about 4 clock cycles to execute and predicts correctly very often. It is replaced in retpoline with a ret instruction which will always mispredict and thus executes far slower. However, unlike doing hard serialization once the ret instruction has been executed out-of-order, the CPU seems to be able to continue without a pipeline flush and thus allowing out-of-order execution to continue. This creates two corner cases: firstly, where dependencies block out-of-order execution across the indirect branch, and secondly, when there is no dependence across the indirect branch. I managed to create a sequence of instructions where retpoline was just as fast as the original unpredicted branch – this is the case when the instructions after the branch depend on the results of the instructions before the branch. Obviously, any co-located hyperthread will be more affected by the retpoline than a predicted indirect branch, but the thread itself does not lose any cycles. At first this seems weird, but imagine completely dependant ALU integer instructions on both sides of the indirect branch – with the branch unit being completely free, both retpoline and the indirect branch will execute concurrently with the integer ALU instructions before the branch. Since the integer ALU instructions after the branch are dependant of those before the branch, they only get scheduled for execution once all prior instructions have been executed. Thus, retpoline and an indirect branch perform equally. In the case of no dependencies across the indirect branch, retpoline is slower. Retpoline will not allow the out-of-order execution to continue until the ret instruction is executed and consequently adding a penalty compared to a predicted indirect branch. In general, the longer it takes for the indirect branch to resolve, the higher the penalty of retpoline is. There is also an indirect performance cost of retpoline which – in my opinion – is likely to be somewhat smaller. The call in line 1 will push a return address onto the RSB (and consequently may evict the oldest entry in the RSB), and thus potentially causing an RSB underflow once a previous call returns. A RSB underflow will manifest itself as a negative performance impact if the evicted RSB entry causes mispredictions or stalls in unrelated code later on. For this to happen a call stack is required to be deeper than the the size of the RSB. The stall penalty of the underflow was big enough to cause Intel to add prediction to the microarchitecture (for Broadwell and Skylake). If this prediction was as efficient as using the RSB, the RSB would not exist.

 

The following table presents the results of my micro benchmarking:

 

Unit Clock cycles Mean Std.dev Median
jmp rsi 350.33 95.48 346
retpoline pause 410.64 65,94 410
retpoline lfence 403.76 25,96 404
retpoline clean 402.13 19,99 402
retpoline pause lfence 406.74 64,76 404
retpoline ud2 404,29 28,70 402

The “retpoline clean” is without the pause instruction, “retpoline lfence” is where pause has been replaced with a lfence instruction. The results are generally not very stable over multiple runs but with all three versions of retpoline often ending up with an average of around 400-410 cycles and the indirect branch being around 50 clks faster on average. Thus, some additional care should be taken before concluding

 that “retpoline pause” is slower than the others. I ran the tests with 100k observations and removed the slowest 10% observations in a primitive noise reduction approach. The microbenchmark is for a bad case with no dependency across the indirect branch for integer instructions (add rsi,1, add rbx, 1 respectively) and is run on a Intel  i7-6700k. While microbenchmarking is important for the arguments I will put forth in this text, it is important to note that they are not reflective on the system’s performance as the incidence of indirect branches is relatively small and this benchmark is manufactured to portrait cases which are worse than in normal scenarios. Also, the microbenchmarking completely ignores any indirect effects.

The weird case of pause

It immediately seemed weird to me to have the pause instruction in the spinlock in line 3 of retpoline. Usually we have pause instructions in spinlocks, but spinlocks execute architecturally at some point. Having a pause of 10 clock cycles for Broadwell and 100 for Skylake in the spinlock potentially causes the CPU to pause the architectural flow that needs to be done before the return instruction can be executed. This may lead to a larger-than-necessary penalty to the spinlock. However, pause is not actually an instruction. It is a hint to the CPU and my guess is that the hint is not taken. I ran retpoline on my Broadwell and my Skylake and compared the penalty: there was almost no difference. This is important because we would expect the different implementation of pause to give different average latencies if it was actually executed (10 clk vs 100 clk). Another argument for pause not being executed speculatively is that pausing is connected to a VmExit. I can only guess why Intel  made it is possible to get a VmExit on pause instruction. I think the most compelling reason would be to use the pause used by spinlock in a guest to process small work items in the hypervisor instead of just idling the CPU. This would probably also help virtualizing hardware. If I am right about this, it would be sensible for the pause hint actually pause only on retirement instead of pausing the CPU speculatively. Another argument is power management: the behavior of the pause instruction depends on the C-State of a co-located hyperthread. Presumably this gives us one of the two side channels as described in a previous section. There is little reason on why a CPU designer would pause a thread which is executing other instructions out-of-order.

 

Discussion of technical debt of retpoline

As clever as retpoline seems, I think it is fundamentally broken. Not because it does not work but instead because it builds a large amount of technical debt. Adding retpoline to a piece of software would require CPU designers to make sure that legacy software is compatible with new CPUs. If a company like Google applies retpoline in their instructure it is fairly unproblematic, Google has a nice inventory of software running on their systems and they can make sure that software applied to new CPUs is recompiled and consequently this poses no constraints on a CPU designer. However, if we add retpoline to a compiler we can be sure it will be added to all kinds of software including virtual machines, containers, specialized software etc. These pieces of software often do not remain supported, they are poorly catalogued and consequently, if the behavior of retpoline changes significantly, these systems may perform suboptimal, be unsafe or even completely break. This effectively ties the hands of CPU designers as they strive to improve the CPU in the future.

 

The first concrete problem I see is the use of the pause hint: regardless of whether the pause hint is taken in retpoline or not, its non-intended use in retpoline ties the behavior of this instruction down for CPU designers in future generations of CPUs. It is worth noting here that we have at least 3 different implementations of the pause hint in different CPUs already (treated as nop, stall 10 clks, stall 100 clks)[5]. Adding to that the complexity of the instruction I outlined above, it would be fair to assume that this instruction might need changes in future generations of CPUs. Thus, having the pause hint in retpoline in scattered over software everywhere might turn out to be a bad idea thinking long term. The good news here is that there probably is a good solution for this. Replacing the pause instruction with lfence will serialize the speculative path and probably even stop it from looping; effectively stopping the execution of the spinlock may free up resources for co-located hyperthread as well as branch execution units for the main thread that otherwise would have been tied up by the jmp instruction in line 4. I ran some test on Skylake and found very near identical performance results for pause and lfence, suggesting that this is viable solution. It is important to note that the lfence instruction was previously documented to only serialize loads. But instead it silently serialized all instructions — which is now the documented behavior since Intel published their errata documents for the Spectre/Meltdown patches [7]. So, the mortgage on lfence is small and has already been signed.

 

The second concrete problem I see is the construct which repoline uses to direct speculative execution into the spinlock is a much bigger problem. On Skylake, return instructions predict by using the indirect branch predictor, requiring Skylake to be handled differently than other CPUs. The problem is twofold: on the one hand, the very common ret instruction needs to be replaced with retpoline (or other non-speculative branching), and on the other hand, the hardware interrupt raised in line 6 may underflow the RSB in the interrupt handler. Thus, repoline is already potentially unsafe on some CPUs (in my opinion much less than without retpolinie though). The technical debt here is that retpoline may be completely unsafe if future CPUs stop relying on the RSB for return prediction. There may be many reasons for a CPU designer to change this. For example a completely unified system for indirect branches or prediction of monotonic returns (returns with only one return address). The latter will keep the RSB save from non-monotonic returns if the RSB underflows and thus may perform better when there are deep call stacks. I do not know whether these things are good ideas, but retpoline might effectively rule them out. Also, one could imagine conflicts with future CPU-based control-flow integrity systems, etc.

 

One could argue that performance optimization rely on microarchitecture details all the time, but there is an important difference between breaking a performance optimization and a security patch. Already we see problems with updating broken libraries amongst software vendors, it’s not difficult to imagine what would happen if a secure software becomes insecure because of CPU evolution.

 

Instead of taking on technical debt potentially forever, I suggest we use either a less secure option (i.e. lfence ahead of indirect branches) or a more expensive option such as IPBP [7] or replacing indirect calls with iret (which is documented to be serializing). That constitutes a high price now, but avoids paying rent on technical debts in perpetuity.

 

Conclusion

Retpoline is an effective mitigation for Spectre variants which rely on causing misprediction on indirect branches on some CPUs. From a pure side channel perspective retpoline adds a different side channel but it is an improvement over the side channel of a traditional indirect branch nonetheless. Retpoline’s performance penalty is complex, but likely smaller than the penalty of the serializing alternatives. However, as Retpoline relies on assumptions about the underlying microarchitecture, it adds technical debt if used widely. If Google, Microsoft or whoever with software-deployment management wants to use it, they have my blessing but there is reasons for scepticism if it is a good idea to have it in general purpose compilers. CPU Vendor’s short term marketing deficits should not lead us to trade small short term performance gains causing technical debt to be paid back with interest in the future for a more complex microarchitecture.

Appendix added 14th of Febuary 2018

Stephen Checkoway of University of Illinois at Chicago commented that it might be worth testing an ud2 instruction in the speculative executed spinlock. I find this idea promising because ud2 essentially just throws an “invalid opcode” exception and thus abusing it for retpoline is likely to produce equivalent performance to that of lfence, perhaps even better, Further, it’s unlikely that using this instruction is associated with any technical debt. The simple test result here shows that the direct performance impact is approximately similar (see table above) that of the other versions of the spinlock.

Some implemented versions of retpoline uses a pause followed by an lfence instruction. I added this to the performance table as well. Thanks to Khun Selom & @ed_maste (twitter account).

 

 

Literature

[1] Turner, Paul. “Retpoline”. https://support.google.com/faqs/answer/7625886

[2] Fogh, Anders. “Two covert channels”. https://cyber.wtf/2016/08/01/two-covert-channels/

[3] Fogh, Anders “Covert Shotgun”. https://cyber.wtf/2016/09/27/covert-shotgun/

[4]Intel, “Intel® 64 and IA-32 Architectures Software Developer Manual: Vol 3”

.https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-system-programming-manual-325384.html

[5] Intel. Intel® 64 and IA-32 Architectures Optimization Reference Manual. July 2017. https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf

[6] Evtyushkin, Dmitry, Dmitry Ponomarev, and Nael Abu-Ghazaleh. “Jump over ASLR: Attacking branch predictors to bypass ASLR.” Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on. IEEE, 2016.

[7] Intel, “Speculative Execution and Indirect Branch Prediction Side Channel Analysis Method”, https://security-center.intel.com/advisory.aspx?intelid=INTEL-SA-00088&languageid=en-fr

Behind the scenes of a bug collision

Introduction

In this blog post I’ll speculate as to how we ended up with multiple researchers arriving at the same vulnerabilities in modern CPU’s concurrently. The conclusion is that the bug was ripe because of a years long build up of knowledge about CPU security, carried out by many research groups. I’ll also detail the rough story behind the research that let me to the bug. My story is probably different than that of the other researchers, but while unique, I am relatively sure that it’s the same for all researchers on most security issues: security research is a long haul thing. The remainder of this blog post is semi-technical.

Why did we get a bug collision on Spectre/Meltdown?

This is of course my take on the event, my personal story, which I’ll detail below. Research collision in CPU research isn’t that uncommon. In fact, the story of my friendship with Daniel Gruss is about a series of collisions. In 2015 I was preparing a talk about row hammer for Black Hat with Nishat Herath [2] when Daniel tweeted that he was able to flip bits from Javascript. I didn’t want to have questions I couldn’t answer, so I started researching it and literally the evening before Daniel published how he was doing it, I knew how he did it. Later Daniel teased me about detecting cache side channels if there were no L3 cache misses. I replied ‘are you timing Clflush?’ He was indeed. You’ll find me being acknowledged in the paper for this reason [1]. I told him he shouldn’t worry about me competing on publishing it, because I was doing research on a side channel in the row buffer and didn’t have time to compete on Clflush. Turns out he and the wu cache clan were too. I blogged it, and wu cache clan wrote a paper on it in Pessl et al. [3]. You’ll find me acknowledged here as well. Not long after that, I did a blog post on breaking KASLR with the prefetch instruction. Obviously, Daniel was doing the exact same thing again. We had enough of competition and started my by now a regular collaboration with the WU cache clan after that point.

So why do things like this happen? (Granted, the story about Daniel is a freaky one.) Well, CPU research is much like drawing a map of an uncharted world. Researchers start from known research and proceed into the unknown, and if they find something, they document it and add it to the map. This essentially means that the frontier looks very similar to everybody leading people into the same paths. This processed is very much sustained by the fact that almost all research in this area is academic and academia is much better organized in terms of recording and documenting than hackers.

For a thing like meltdown, the real foundation was laid with the work on cache side channels sometime back around 2005. There are many papers from this time, I’ll mention Percival [4] because it’s my favorite. Another milestone paper was Yuval Yarom’s paper on Flush+reload [5]. Note that Yuval is also partial to the Spectre paper. With this foundation, a subgenre of papers emerged in 2013 with Hund, Willems & Holz [6]. They essentially noticed that when an unprivileged user tries to read kernel mode memory, the CPU actually does a great part of the read process before making an error, allowing a user to observe not the data of the kernel, but the layout of the kernel – this is known as a KASLR break and is important for classical exploits. This work was followed up with improvements including Gruss et al [7] and Yang et al [8]. Both papers showed that a lot more work was being done by the CPU than was strictly needed when an unprivileged user accesses kernel memory – an important prerequisite for Meltdown to exist. Also in 2016, another KASLR break using branches from Evtuishkin et al. emerged[9]. They didn’t try to read from the kernel but rather found that branch prediction leaked information. This is important because branch prediction is a precursor for speculative execution and thus build a bridge towards Meltdown. Felix Wilhelm [11] extended Evtuishkin et al. [9] to extend to hypervisors which are likely to be important in the rationale for hypervisors being affected. In fact, I think Jann Horn mentioned this blog post as an inspiration. There where other works, like that of Enrique Nissim[11] which showed that the KASLR breaks real-world applications with a classic exploit. Also in 2016, some work was being done on side channels in the pipeline. My Covert Shotgun blog post is an example of this literature[12]. To sum it up, by the end of 2016 it was known that unprivileged reads from user mode to kernel mode did more processing than was strictly required, it was known that branches were important, and there was work going on examining the pipeline. In end effect, the Meltdown bug was surrounded on all sides. It was a single blind spot on the map – obviously, nobody knew that there’d actually be a bug in this blind spot and I think most did not believe such as bug existed. Other people would probably pick other papers and there where many. My point remains: a lot of people moved towards this find over a very long period of time.

The personal perspective
I work for GDATA Advanced Analytics and there I work in an environment that is very friendly and supportive of research. Without research, a consultant company like ours cannot provide excellence for their customers. However, CPU bugs don’t pay the bills, so my day job does not have much to do with this. I spend my time helping customers with their security problems: malware analysis, digital forensics, and incident response being the main tasks that I do on a day to day basis. This means that the bulk of the work I do on CPUs is done in my spare time after hours.

So my story with Meltdown and Spectre starts essentially at Black Hat 2016 where Daniel Gruss and I presented our work on the prefetch instruction. The video of our talk can be found here [13]. As described above, this work slowly gave me this fuzzy feeling that maybe this work is just the tip of the iceberg. In fact did a blog post about the meta’s on breaking KASLR in October 2016 [14] and presented it at RuhrSec 2017 [15], so I was very well acquainted with that literature. Meanwhile, I wanted to get away from working with caches and accidentally found some covert channels and wrote about them[16]. I thought there might be more and started picking apart the pipeline in search of covert channels. My frustration led me to automate finding covert channels which resulted in this blog post [12] and later this talk at HackPra [17]. In the talk, you can hear that I’m still frustrated “I don’t care, I’ll just use a shotgun”. To this day I have many unanswered questions about the pipeline..
Later that year I met up with the WU Cache Clan at CCS where we were presenting the academic version of the prefetch KASLR break[7]. We had some beer-fueled conversations. The conversations were added upon at Black Hat Europe (Michael Schwarz and I did a talk on the row buffer side channel we’d done concurrent work on) where I started believing that meltdown might be possible, but yet without a clue as to how it could possibly work.

I finally made the connection to speculative execution on December 2016/January 2017 when I prepared the presentation for HackPra about Covert Shotgun[17]. There are a lot of slides about speculative execution. At first, I didn’t think about reading kernel memory. My first “attack” was just another attack on KASLR. Essentially it was Hund, Willems, Holz in a speculative version- the work never really got finished (timing inside of speculative execution is possible, but not easy – I did not find a solution to this problem until much later) the rationale for doing this work is that it would solve a problem with their method and do this stuff in my spare time – fun is essential and this sounded like fun. So my project name for Meltdown was “undead KASLR”, despite me quickly figuring out there was bigger fish to be fried. I told my friend Halvar Flake about the weird ideas while presenting at IT-Defense in February and his encouragement was a big part of me actually continuing, because I didn’t believe it would work.

In March, I had the first chance to do some real work on the project during a small get away from work to present Jacob Torrey’s and my work on PUFs (the work was really mostly Jacobs – and not related to Meltdown per se) at Troopers 17. I researched mornings in the hotel and even did some research during other people’s talks. Fun fact: there is s a video of me doing a spectre-style attack POC on the code I’d added to Alex Ionescu’s wonderful Simplevisor. Hi Alex! I’m the bald head seen to the lower left of the center isle doing weird head movements and packing away my laptop as I succeeded at around 19 minutes into the video. The second half of the talk was pretty awesome btw!!

Later in March 2017, I visited Daniel, Michael, and Clementine Maurice in Graz to work on a common project we had on detecting double fetch bugs with a cache attack (which became this paper [21]. Here I tried to pitch my idea, because with the workload I had I knew it would be difficult for me to realize alone. Unfortunately, I wasn’t the only one fully booked out and Daniel, Michael and myself were super skeptical at that time, despite the slight encouragement I’d had at Troopers. So we decided to finish the stuff we were already doing first. Might I add here that Daniel and Michael did some really cool stuff since then? Struggling with work and making a sufficient contribution to the double fetch paper the project was on ice. The main reason why my name is so far back on the paper is that I didn’t have time to pull my weight.

After Troopers, I didn’t get much done and was really frustrated about it. So in July, I started back up again in the evenings. It helped me immensely that my climbing partner suffered an injury giving me a bit more time. This time around I was working targeted at Meltdown. I were doing stuff much too complicated at first and wasted a lot of time on that. Then I tried to simple things up as much as I could and this is why I ended up with a negative result. I wrote up the blog on company time on a Friday before noon before leaving early on a vacation. Luca Ebach helped proof read it. If he hadn’t it would’ve been unreadable and Tomasulo would’ve been spelled wrong.. The “Pandoras box” part of the blog post is a reference to the limited and unfinished stuff I did on “Spectre”, which I was sure would work at the time, but needed checking before I’d commit to blogging about it. While on vacation, I decided to wrap things up in an academic paper and with some positive results in my hands, seek help from some academic professionals. So I have continued my research afterward, after all you don’t open Pandora’s box without looking what is inside. In light of recent events, I shall not be publishing the rest of the stuff I did.

The stuff that Jann Horn did, is really really awesome, the same goes for the Spectre/Meltdown papers. It is wonderful to see that I was barking up the right tree. It is important to me to mention that Jann Horn reported his research prior to my blog post and did not have access to mine prior to that date.

Literature

[1] Gruss, Daniel, et al. “Flush+ Flush: a fast and stealthy cache attack.” Detection of Intrusions and Malware, and Vulnerability Assessment. Springer International Publishing, 2016. 279-299.
[2] Nishat Herath, Anders Fogh. “These Are Not Your Grand Daddy’s CPU Performance Counters” Black Hat 2015, https://www.youtube.com/watch?v=dfIoKgw65I0
[3] Pessl, Peter, et al. “DRAMA: Exploiting DRAM Addressing for Cross-CPU Attacks.” USENIX Security Symposium. 2016.
[4] Percival, Colin. “Cache missing for fun and profit.” (2005).

[5] Yarom, Yuval, and Katrina Falkner. “FLUSH+ RELOAD: A High Resolution, Low Noise, L3 Cache Side-Channel Attack.” USENIX Security Symposium. 2014.
[6] Hund, Ralf, Carsten Willems, and Thorsten Holz. “Practical timing side-channel attacks against kernel space ASLR.” Security and Privacy (SP), 2013 IEEE Symposium on. IEEE, 2013.
[7] Gruss, Daniel, et al. “Prefetch side-channel attacks: Bypassing SMAP and kernel ASLR.” Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM, 2016.
[8] Jang, Yeongjin, Sangho Lee, and Taesoo Kim. “Breaking kernel address space layout randomization with intel tsx.” Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM, 2016.
[9] Evtyushkin, Dmitry, Dmitry Ponomarev, and Nael Abu-Ghazaleh. “Jump over ASLR: Attacking branch predictors to bypass ASLR.” Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on. IEEE, 2016.
[10] Wilhelm, Felx. “Mario Baslr”, https://github.com/felixwilhelm/mario_baslr
[11] Nissim, Enrique. “I Know Where Your Page Lives: De-randomizing the Windows 10 Kernel”. https://www.youtube.com/watch?v=WbAv2q9znok
[12] Fogh, Anders. “Covert Shotgun”, https://cyber.wtf/2016/09/27/covert-shotgun/
[13] Fogh, Anders, Gruss, Daniel. “Using Undocumented CPU Behavior to See Into Kernel Mode and Break KASLR in the Process” Black Hat 2016.
[14] Fogh, Anders, “Micro architecture attacks on KASLR”,https://cyber.wtf/?s=kaslr
[15] Fogh, Anders. “Micro architecture attacks on KASLR and More”, https://www.youtube.com/watch?v=LyiB1jlUdN8
[16] Fogh, Anders, “Two covert channels”, https://cyber.wtf/2016/08/01/two-covert-channels/
[17[ Fogh, Anders, “Covert Shotgun”, https://www.youtube.com/watch?v=oVmPQCT5VkY&t=34s, Hack Pra 2017
[18] Schwarz, Michael, Fogh, Anders. “Drama: how your DRAM becomes a security problem”. Black Hat Europe 2016 https://www.youtube.com/watch?v=lSU6YzjIIiQ
[19] Ionescu, Alex, “SimpleVisor” https://github.com/ionescu 007/ “
[20] Neilson, Graeme, “Vox Ex Machina”, Troopers 17. https://www.youtube.com/watch?v=Xrlp_uNBlSs&t=1145s
[21] Schwarz, Michael, et al. “Automated Detection, Exploitation, and Elimination of Double-Fetch Bugs using Modern CPU Features.” arXiv preprint arXiv:1711.01254 (2017).
[22] Fogh, Anders. “Negative result: reading kernel memory from user mode” https://cyber.wtf/2017/07/28/negative-result-reading-kernel-memory-from-user-mode/

Emotet drops ZeuS Panda targeting German and Austrian online banking users

Emotet is currently one of the prevalent threats on the Internet. The former banking trojan is now known to steal passwords and to drop other malware like Dridex on its infected machines. We recently found Emotet spreading Zeus Panda, which presented us with an opportunity to link some of our research on Emotet with our analysis of ZeuS Panda.  The Zeus Panda sample used in this wave is rolled out through Emotet in german-speaking countries and targets online banking users in Germany and Austria.

The Emotet C2 server drops additional malware to infected system. Whether a system receives such a package seems to be based on the geographical location of the infected system in question. After the additional malware is downloaded from the C2 server, it is written to a file in %ALLUSERSPROFILE% (C:\ProgramData in recent Windows versions) with a random name of 4 to 19 characters length and the file extension “.exe”. Emotet is capable of executing this binary in two different ways, either of which is chosen by the C2 server. The first mode executes the malware in the same context that Emotet is running in, the second mode executes the malware in the context of the currently logged-on user.

As stated above, the current wave downloads and executes the well-known ZeuS Panda banking trojan. To know which banking sites it should attack and how to modify the site’s content, the trojan needs so-called webinjects. From the URL masks of the webinjects this sample uses, we can tell that it currently targets online banking customers in Germany and Austria. All injects write a single script reference into the targeted websites. When the targeted site is loaded, the browser loads the referenced script, which is then executed in the context of the banking website. The only difference between the webinjects is the last number in the URL of the script source. This number seems to define the targeted website, which allows the server to deliver a target-specific script. The script actually downloaded is obfuscated by a simple string encryption. The actual script is part of an Automated Transfer System (ATS) which tries to persuade the user into transferring money to an account the attacker specifies.

scriptSchutzanalageRetoureThe above screenshots show an exemplary representation on the modification of the banking websites. They show two different attack scenarios: The first script tries to trick the user into performing an transaction in the guise of a security check. The attackers “inform” the customer of newly installed security measures on the banking website, coercing the user to complete a training using a demo account, before they are able to access their account again. During this training, a real transaction is made in the background to an account that the attacker specifies. The phrasing in the text is lousy and should raise suspicion with most customers.

 

The second script tries to persuade the user that an erroneous transfer was made to their account. It suggests to go to a bank branch or make the return transfer online. Additionally, the script blocks access to the banking account until the return transfer has been completed. The phrasing in the text is better than in the first script and may not raise suspicion at first glance.

The first script resembles word by word the webinject Kaspersky identified during their analysis of Emotet in 2015. At this time Emotet contained its own banking trojan capability and delivered the webinjects directly into the browser. As ZeuS Panda uses the same webinject format as the old Emotet, we can speculate about the reasons:

  • The webinject is acquired from the same creator
  • The group behind Emotet has dropped developing their own banking trojan and acquires such trojans from other malware authors
  • The group behind Emotet developed multiple banking trojans for its own use and for sale

It seems Emotet is not only used to sell distribution of malware, but also used by its owners. It is also possible that the group behind Emotet uses the slim downloader as an entry point for targeted attacks. In this case the group can spread Emotet worldwide and distribute specific malware to each target. As the real malicious payload is only downloaded after some time and only to specific targets, analysts can not directly draw conclusions on the real intention of an infection.

IOCs

Emotet:

C2:

5.9.195.154
45.73.17.164
60.32.214.242
85.25.33.71
194.88.246.242
213.192.1.170
217.13.106.16
217.13.106.246
217.13.106.249

SHA256:

0d25cde8d49e1bcf6a967c0df6ac76992ff129ea5c30a1492a5bedd313e6fb51
c287a9aa25ed6afc54bc5ebe4b098675f3fa4b7cb51fbdcfb50591b4b8fa3b90

ZeuS Panda:

C2:

uamanshe.gdn
ugjeptpyour.top

SHA256:

4fe20a9cf5e5c28ec55aa529179f7fe6df3cda8ae43340b04b2402f43dfefd5f
fbd9e31cc5cbfce2b8135234fdcfdac7fa48a127aa6f3644d05c6ba77bd6d903

Emotet harvests Microsoft Outlook

The original German blog post can be found on the G DATA Blog.

Emotet has been known as a trojan for years. Former versions focused on attacking online banking users, however the current Emotet was  transformed into a downloader and information stealer. The first reports of this new variant were published by CERT Polska in April 2017. Since then, Emotet has been spreading through spam phishing mails containing a link to a Microsoft Word document that acts as dropper for the Emotet binary.

Recently, CERT-Bund again warned about the spam mails which spread Emotet. The sender address of these emails is spoofed to appear as a sender known to the recipient. This strengthens the trust in the mail and increases the probability that the recipient opens the attachment or link without further consideration.

For this to work, the entities spreading Emotet need to have at least superficial knowledge of the social network a target interacts with via email. Acting opportunistically Emotet delivers a specific module to infected systems to harvest all emails in Microsoft Outlook accounts of the current user, allowing it to extract the relations between sender and receiver.

 

MAPI-Functions

To obtain the information from Outlook, the module takes advantage of the standardized interface MAPI. The picture above shows the loading of the MAPI-DLL and the retrieval of the needed functions. Utilizing this interface, the module iterates through all Outlook profiles it can access on the computer. It extracts all E-Mail-Account Names and E-Mail-Addresses from each profile. Afterwards it searches for emails recursively in each folder in the profile. From each mail found it extracts the sender (displayed name and mail address) and all recipients (displayed names and mail addresses) inclusive the recipients in the CC- and BCC-fields and saves them in relation to each other. The picture below shows the extracted fields from the emails. In case a field only contains a reference to an address book entry, the module extracts the name and email address from the address book. In this process only the mail header is evaluated, the content of the mails is not analyzed.

Fields

After the Emotet module has searched all profiles, folders, and emails, it writes the data it has retrieved in a temporary file in the directory %PROGRAMDATA%. The email addresses are sorted descending by how often they occur. Each address is extended with all contacts, that are in relation to it. However, two cases are distinguished:

  • if the referenced contact is the sender of the mail, it is extended with all recipients
  • if the referenced contact is the recipient of the mail, it is only extended with the sender

Example (Mailbox of A):mail

Mail 1: A sends to B and C

Mail 2: D sends to A

Mail 3: C sends to A , D, and E

A is referenced three times and therefore is placed on top of the list. A has a relation to B and C through mail 1, thus B and C get connected with A. Mail 2 shows a connection from D to A, thus D gets connected with A too. The relation from C to A in mail 3 is ignored, because it is already captured in mail 1 (A→C). Mail 3 contains the additional relations C→D and C→E. As no relations between C↔D and C↔E are already in the list, the contacts D and E get assigned to the contact C and are appended to the list.

The complete list, which gets transferred to the attacker, looks like this:

A<A@mail.com>; B<B@mail.com>; C<C@mail.net>; D<D@mail.com>
C<C@mail.net>; D<D@mail.com>; E<E@mail.com>

Afterwards the module encrypts the file, transfers it to the attacker and removes it from disk.

This  allows the attacker to get a condensed but comprehensive overview of the social network graph behind a victims email communications. With such a list, an attacker has knowledge of the relation between persons and can send spam mails with suitable sender header without great afford. Additionally, an attacker learns relations between contacts whose computers are not yet infected.

To deliver the spam mails to the suitable recipients, the attacker needs valid E-Mail accounts. For this task, they use an additional module that is able to extract the credentials from mail programs and transfer them to the attackers. To extract the credentials from all common mail programs, such as Microsoft Outlook, Mozilla Thunderbird, and Windows Mail, this module utilizes an integrated copy of the application Mail PassView from the company NirSoft. It writes this information to a temporary file, which is then encrypted and transfered to the attacker. Once transfered the temporary file is deleted.

DGA classification and detection for automated malware analysis

Introduction

Botnets are one of the biggest current threats for devices connected to the internet. Their methods to evade security actions are frequently improved. Most of the modern botnets use Domain Generation Algorithms (DGA) to generate and register many different domains for their Command-and-Control (C&C) server with the purpose to defend it from takeovers and blacklisting attempts.

To improve the automated analysis of DGA-based malware, we have developed an analysis system for detection and classification of DGA’s. In this blog post we will discuss and present several techniques of our developed DGA classifier.

The DGA detection can be useful to detect DGA-based malware. With the DGA classification it is also possible to see links between different malware samples of the same family. Such a classification is expressed with a description of the DGA as a regex.
Moreover, our analysis methods are based on the network traffic of single samples and not of a whole system or network, which is a difference to most of the related work.

DGA-based Botnets

A Domain Generation Algorithm (DGA) generates periodically a high number of pseudo-random domains that resolve to a C&C server of a botnet [H. 16]. The main reason of its usage by a botnet owner is that it highly complicates the process of a takeover by authorities (Sinkholing). In a typical infrastructure of a botnet that uses a static domain for the C&C server, authorities could take over the botnet with cooperation of the corresponding domain registrar by changing the settings of the static C&C domain (e.g. changing the DNS records).

infrastructure of typical botnets
Typical infrastructure of a botnet

With the usage of a DGA that is generating domains dynamically which resolve to the C&C server there is no effective sinkholing possible anymore. Since the bots use a new generated domain after every period to connect to the C&C server, it would be senseless to take control of a domain that is not used anymore by the bots to build up a connection to the C&C server.

infrastructure of typical dga botnets
Typical infrastructure of a DGA botnet

The C&C server and the bots use the same DGA with the same seed, so that they are able to generate the same set of domains. DGA’s use mostly the date as a seed to initialize the algorithm for domain generation. Hence the DGA creates a different set of domains everyday its run. To initialize a connection to the C&C server the bot needs to run first the DGA to generate a domain, that could be possibly also generated on the side of the C&C server, since both are using the same algorithm and seed [R. 13]. After every domain generation, the bot attempts to resolve the generated domain. These steps are repeated until the domain resolution succeeds, so that the bot figures out the current IP address of the corresponding C&C server. Through that DGA domain the bot can set up a connection to the C&C server.

Motivation

DGA detection can be very helpful to detect malware, because if it is possible to detect the usage of a DGA while analyzing the network traffic of a single sample, then it is very likely that the analyzed sample is malicious, since DGA’s are used commonly by malware but not by benign software. DGA classification is the next step in the analysis after a DGA has been detected. A successful classification returns a proper description of a DGA. With such a unified description, it is possible to group malware using the same DGA. Being able to group malware by correlating characteristics, leads to an improvement to the detection of new malware samples of these families. Therefore, the signatures of recently detected malware samples will be automatically blacklisted. The following figure shows for an non-DGA malware that grouping malware families based on the same domains in their DNS requests traffic will be only possible, if they use all the same and static C&C domain:

static domain
Two samples using the same static C&C domain

If the malware uses a DGA, then the grouping of malware will not be trivial anymore, because generated DGA domains are just used temporarily, thus using those to find links between samples would not be very effective.

domain mismatch
Two samples with the same DGA but a different seed

Note also that occurring domains in the network traffic of the recently analyzed malware sample could differ on another day with the same sample analyzed, since many DGA’s use the date as a seed. The solution is to calculate a seed-independent DGA description for every analyzed sample using a DGA. That description can be used then as a bridge between malware samples using the same DGA.

pattern descriptor
Pattern descriptor to abstract different DGA seeds

To solve this problem, we have divided it into three smaller tasks. Thus, the DGA classifier is structured into three components. Each component solves a task that contributes to the result of the DGA classification.

In this blog post we concentrate only on approaches for DGA detection and classification that are automatable, since we want to analyze a very high number of samples. Furthermore, we want to avoid as much as possible unnecessary network traffic, therefore we focus only on offline methods.

DGA Detection

This approach for DGA detection is based on statistical values calculated over the relevant label attributes of the domains. Since the domains generated by a DGA follow mostly a pattern, it is very useful to calculate the standard deviation of some attribute values [T. 13]. The average value can be also used for some attribute values to measure whether a domain is generated by a DGA or not. Those statistical values are also calculated over a list of domains from the Top 500 Alexa Ranking. These are considered as reference values for non-DGA domains regarding the relevant label attributes.

The domains with multiple levels are split into their labels for further analysis.
E.g. this domain: http://www.developers.google.com is split into:
com – Top-level domain (TLD)
google – Second-level domain
developers – Third-level domain
www – Fourth-level domain

All domains in the domain list resulting from a sample are compared level-wise, such that the labels of every domain are only compared with the same level.

To find proper indicators for DGA usage, we have done a level-wise comparison of statistical values calculated over several lexical properties of DGA domains and non-DGA domains (e.g. from Top 500 Alexa Ranking).

domain level.png
Different kinds of DGA patterns

Our experiments have proven that these statistical values over domain levels are very effective for DGA detection:

  • Average of the . . .
    • number of used hyphens
    • maximum number of contiguous consonants
  • Standard deviation of the . . .
    • string length of the label
    • consonant and vowel ratio
    • entropy
  • Redundancy of substrings or words in case of a wordlist DGA

With all these arguments, we can build a score with a specific threshold. If the score exceeds the threshold, the component will decide that the analyzed domain list was generated by a DGA. Since the arguments are based on statistical values, which lose their significance with smaller sets, it is also important to consider the case with too few domains. In this regard, the score is scaled down.

Separation of non-DGA Domains

Malware tries often to connect at first to a benign host (e.g. google.com) to check their connectivity to the internet. So, in case of DGA-based malware, the samples do not only send requests to DGA domains, but also to non-DGA domains. Hence the program for DGA classification needs to expect a domain list containing DGA domains and non-DGA domains. Before the program can classify a DGA, it needs to filter out the non-DGA domains. In this process, we assume that the majority of the domains in the domain list of the sample are DGA domains. Therefore, the non-DGA domains are considered as outliers. We used different outlier methods to identify non-DGA domains:

  • Outlying TLD-label
  • Outlying www-label
  • Find outlier with the method of Nalimov [A. 84] regarding following label attributes:
    • String length of the label
    • Digit and string length ratio
    • Hyphen and string length ratio
    • Consonant and vowel ratio
  • Outlying label count
  • Find outliers by too few occurring values regarding the following label attributes (the more a value occurs, the higher is the probability that it belongs to a DGA domain. We use the opposite case to find non-DGA domains here):
    • String length of the label
    • Consonant and vowel ratio
    • Digit and string length ratio
    • Entropy
    • Hyphen and string length ratio
    • Relative position of the first hyphen in the label

DGA Classification

After the separation process of DGA domains from non-DGA domains, we start with the classification of the DGA. The classifier analyzes the list of DGA domains and creates a, specific as possible, regex that matches all these DGA domains. If the separation is not completely successful, the program will continue with the classification based on DGA domains and non-DGA domains which could lead to a wrong description of the DGA. But not every failing separation process causes a wrong classification. In some cases, if non-DGA domains cannot be differentiated from DGA domains regarding any domain attributes, then the classification will still return the correct DGA description, since it covers only the relevant domain attributes. If the failing separation process causes a wrong DGA description, then the resulting wrong or imprecise DGA description could be interpreted still as a fingerprint calculated over the requested non-DGA domains and the DGA domains. That fingerprint is still useful to group malware of the same family, because it is very common that those requested non-DGA domains occur in other malware samples of the same family, too.

DGA’s do not generate necessarily always the same set of domains, because in most cases the seed of the DGA is changed (usually the date is used as seed).
In the following picture, you can see that the calculated DGA regexes are not matching because of the differentiating first letter, which seems to be seed-dependent in that case:

not equal dga
DGA with seed dependent first character

An important requirement to the automatically generated DGA description is that it needs to be independent of the seed. Since it is in our perspective not possible to determine which part of a DGA domain is seed dependent, we use an approach that tries to generalize the seed-dependent part of a domain.
For this task, we use three layers of regexes that are hierarchically arranged:

  1. Layer: very generalized pseudo-regex
  2. Layer: generalized regex
  3. Layer: specific regex

All those regexes can be interpreted as DGA descriptions (calculated with only one sample) of the same DGA with different precision.
Such hierarchy could look like this:

tinba dga
Tinba-DGA
simda dga
Simda-DGA

Evaluation

Out of 113.993 samples, the DGA classifier detects 782 DGA-based malware samples.
To determine the false positive rate, we have reviewed the results of the analysis system manually. Regarding the DGA detection we have found 38 false positives in our result set. Hence we have a false positive rate that is lower than 0.049% (with the assumption that the DGA-based malware samples queried a relative high number of different domains). A false negative evaluation is hard in this case, because the number of input sample is too high for manual evaluation. For an automatic false negative evaluation, the required ground truth of a large sample set is missing.

The following excerpt shows some specific DGA regexes from the DGA classification, which used 38.380 DGA-based malware samples as input:

Domain Fingerprint / Regex Matches Family name
[0-9A-Za-z]{8}\.kuaibu8\.cn 569 Razy
[a-hj-z]{3}y[a-z]{3}\.com 2082 simda
[a-z]{11}\.eu 829 simda
[a-z]{6,12}\.(com|info|net|org|dyndns\.org) 2047 Pykspa
[b-y]{12}\.(com|in|net|ru) 8508 tinba
[b-y]{12}\.com 6296 tinba
[b-y]{12}\.pw 7714 tinba
[b-y]{12}\.(biz|pw|space|us) 35 tinba
[b-y]{12}\.(cc|com|info|net) 17 tinba
[d-km-y]{12}\.(com|in|net|ru) 172 tinba
[d-y]{12}\.(com|in|net|ru) 110 tinba
[c-y]{12}\.(com|in|net|ru) 45 tinba
[df-km-su-x]{12}\.(com|in|net|ru) 110 tinba
[a-z]{8}\.info 216 tinba
v[12]{1}\.[a-uw-z]{7}\.ru 77 Kryptik
v1\.[a-uw-z]{7}\.ru 579
[b-np-z]{7,11}\.(com|net) 113
[0-9A-Za-z]{8}\.[aiktux]{3}i[abnu]{2}8\.cn 114
[a-y]{6,19}\.com 644 Ramnit
[a-z]{14,16}\.(biz|com|info|net|org) 173

It is conspicuous that the Tiny Banker Trojan (Tinba) has a very high occurrence with different specific regexes in the result set. After generalizing the most regexes of Tinba, as described in section 2.3, it will be possible to group all samples with only one regex. The missing family names are given by the fact that we could not detect automatically to which malware family the samples that used the DGA belongs.

Conclusion

The result shows that DGA detection and DGA classification can be very useful to detect new malware samples by their DGA. Hence it is also possible to find links between old and new malware samples of the same family via their classified DGA.
The DGA detection seems to be very reliable for samples that have queried many different domains. Our implemented concept for DGA classification seems to be in many cases successful. However, there are still cases where the calculated DGA descriptions are not correct, because the created patterns are sometimes overfitted to the given domain lists or rather non-DGA domains were considered in the calculation of the DGA descriptions, too. To confine this problem, we use a multi-layered regex generalization.
Even wrong DGA descriptions can be still considered as fingerprints calculated over the domain list of the sample. That fingerprint could be used to classify the DGA-based malware, so that it makes still a good contribution to automated malware analysis.

Literature

[A. 84] A. Zanker. Detection of outliers by means of Nalimov’s test – Chemical Engineering, 1984.

[H. 16] H. Zhang, M. Gharaibeh, S. Thanasoulas, C. Papadopoulos Colorado State University, Fort Collins, CO, USA. BotDigger: Detecting DGA Bots in a Single Network, 2016.

[R. 13] R. Sharifnya and M. Abadi – Tarbiat Modares University Tehran, Iran. A Novel Reputation System to Detect DGA-Based Botnets, 2013.

[T. 13] T. Frosch, M. Kührer, T. Holz – Horst Görtz Institute (HGI), Ruhr-University Bochum, Germany. Predentifier: Detecting Botnet C&C Domains From Passive DNS Data, 2013.

Zeus Panda: Down To The Roots

Some time ago, we analyzed Panda’s webinjects to get an insight in how they actually work and to understand their communication with the ATS servers (read it here: part 1, part 2).

In the last few weeks, we drilled down on the binary itself and had a closer look on this side of the Zeus.Panda malware. In the resulting whitepaper, we present a more in-depth analysis of the malware executable, detailing the malware’s actions on the victim’s PC beyond and in addition to infecting browsers to enable fraudulent banking transactions.

Find the whitepaper here (pdf).

 

Negative Result: Reading Kernel Memory From User Mode

I were going to write an introduction about how important negative results can be. I didn’t. I assume you can figure out for yourself why that is and if not you got all the more reason to read this blog post. If you think it’s trivial why my result is negative, you definitely need to read the blog post.

The memory subsystem

I think most researchers would immediately think that reading kernel memory from an unprivileged process cannot work because “page tables”. That is what the Instruction Set Architecture or ISA says. But in order to understand why that was unsatisfactory to me we need to dig a bit deeper and I’ll start at the beginning.

 
When a software running on a Core requires memory it starts a so called “load” command. The load command is then processed in multiple stages until the data is found and returned or an error occurred. The Figure below shows a simplified version of this sub system.

Memory hierachy.png

Software including operating system uses virtual addressing to start a load (which is what a memory read is called inside the CPU). The first stage of processing is the L1 cache. The L1 cache is split between a data and an instruction cache. The L1 Cache is a so called VIPT or Virtually Indexed, Physically Tagged cache. This means the data can be looked up by directly using the virtual address of the load request.  This along with central position in the core makes the L1 incredibly fast. If the requested data was not found in the L1 cache the load must be passed down the cache hierarchy. This is the point where the page tables come into play. The page tables are used to translate the virtual address into a physical address. This is essentially how paging is enabled on x64. It is during this translation that privileges are checked. Once we have a physical address, the CPU can query the L2 cache and L3 cache in turn. Both are PITP caches (Physically Indexed, Physically Tagged) thus requiring the translation in the page tables before the lookup can be done. If no data was in the caches then the CPU will ask the memory controller to fetch the data in main memory. The latency of a data load in the L1 cache is around 5 clock cycles whereas a load from main memory is typically around 200 clock cycles. With the security check in the page tables and the L1 located before those we see already at this point that the because “page table” argument is too simple. That said – the intel manuals software developer’s manuals [2] states that the security settings is copied along the data into the L1 cache.

 

Speculative execution

In this and the next section I’ll outline the argumentation behind my theory, but only outline. For those a bit more interested it might be worth looking at my talk at HackPra from January [3] where I describe the pipeline in slightly more detail, though in a different context. Intel CPU’s are super scalar, pipelined CPU’s and they use speculative execution and that plays a big role in why I thought it may be possible read kernel memory from an unprivileged user mode process.   A very simplified overview of the pipeline can be seen in figure below.

 

pipeline.png

The pipeline starts with the instruction decoder. The instruction decoder reads bytes from memory, parse the buffer and output instructions.  Then it decodes instructions into micro ops. Micro ops are the building blocks of instructions. It is useful to think of x86 CPU’s as a Complex Instruction Set Computer (CISC) with a Reduced Instruction Set Computer (RISC) backend. It’s not entirely true, but it’s a useful way to think about it. The micro ops (the reduced instructions)  are queued into the reordering buffer. Here micro ops is kept it until all their dependencies have been resolved. Each cycle any micro ops with all dependencies resolved are scheduled on available execution units in first in ,first out order. With multiple specialized execution units many micro ops are executing at once and importantly due to dependencies and bottlenecks in execution units micro ops need not execute in the same order as the entered the reorder buffer. When a micro op is finished executing the result and exception status are added to the entry in the reorder buffer thus resolving dependencies of other micro ops. When all Micro ops belonging to a given instruction has been executed and have made it to the head of the reorder buffer queue they enter the retirement processing. Because the micro ops are at the head of the reorder buffer we can be sure that retirement is in the same order in which the micro ops were added to the queue. If an exception was flagged in the reorder buffer for a micro op being retired an interrupt is raised on the instruction to which the micro op belonged. Thus the interrupt is always raised on the instruction that caused even if the micro op that caused the interrupt was executed long before the entire instruction was done. A raised interrupt causes a flush of the pipeline, so that any micro op still in the reorder buffer is discarded and the instruction decoder reset. If no exception was thrown, the result is committed to the registers. This is essentially an implementation of Tomasulo’s algorithm [1] and allows multiple instructions to execute at the same time while maximizing resource use.

 

Abusing speculative execution

Imagine the following instruction executed in usermode

mov rax,[somekernelmodeaddress]

It will cause an interrupt when retired, but it’s not clear what happens between when the instruction is finished executing and the actual retirement. We know that once retired any information it may or may not have read is gone as it’ll never be committed to the architectural registers.. However, maybe we have a chance of seeing what was read if Intel relies perfectly on Tomasulo’s algorithm. Imagine the mov instruction sets the results in the reorder buffer as well as the flag that retirement should cause an interrupt and discard any data fetched. If this is the case we can execute additional code speculatively. Imagine the following code:

mov rax, [Somekerneladdress]

mov rbx, [someusermodeaddress]

If there are no dependencies both will execute simultaneous (there are two execution units for loads) and while the second will never get it’s result committed to the registers because it’ll be discarded when the firsts mov instruction causes an interrupt to be thrown. However, the second instruction will also execute speculatively and it may change the microarchitectural state of the CPU in a way that we can detect it. In this particular case the second mov instruction will load the

someusermodeaddress

into the cache hierarchy and we will be able to observe faster access time after structured exception handling took care of the exception. To make sure that the someusermodeaddress is not already in the cache hierarchy I can use the clflush instruction before starting the execution. Now only a single step is required for us to leak information about the kernel memory:

Mov rax, [somekerneladdress]

And rax, 1

Mov rbx,[rax+Someusermodeaddress]

 

If this is executed last two instructions are executed speculatively the address loaded differs depending on the value loaded from somekerneladdress and thus the address loaded into the cache may cause different cache lines to be loaded. This cache activity we can we can observe through a flush+reload cache attack.

The first problem we’re facing here is we must make sure the second and third instruction runs before the first one retires. We have a race condition we must win. How do we engineer that? We fill the reorder buffer with dependent instructions which use a different execution unit than the ones we need. Then we add the code listed above. The dependency forces the CPU to execute one micro op at a time of the filling instructions. Using an execution unit different than the one we used in the leaking code make sure that the CPU can speculatively execute these while working on the fill pattern.

The straight forward fill pattern I use is 300 add reg64, imm. I choose add rax, 0x141. I chose this because we have two execution units that are able to execute this instruction (Integer alus) and since they must execute sequentially one of these units will always be available to my leakage code (provided that another hardware thread in the same core isn’t mixing things up).

Since my kernel read mov instruction and my leaking mov instruction must run sequentially and that the data fetched by the leaking instruction cannot be in the cache the total execution time would be around 400 CLK’s if not the kernel address isn’t cached. This is a pretty steep cost given that an add rax, 0x141 cost around 3 CLK’s. For this reason, I see to it that the kernel address I’m accessing is loaded into the cache hierarchy. I use two different methods to ensure that. First, I call a syscall that touches this memory. Second, I use the prefetcht0 instruction to improve my odds of having the address loaded in L1. Gruss et al [4] concluded that prefetch instructions may load the cache despite of not having access rights. Further they showed that it’s possible for the page table traverse to abort and that would surely mean that I’m not getting a result. But having the data already in L1 will avoid this traverse.
 

All said and done there are a few assumptions I made about Intels implementation of Tomasulo’s algorithm:

1)    Speculative execution continues despite interrupt flag

2)    I can win the race condition between speculative execution and retirement

3)    I can load caches during speculative execution

4)    Data is provided despite interrupt flag

 

Evaluation

I ran the following tests on my i3-5005u Broadwell CPU.

I found no way I could separately test assumption 1,2 and 3. So I wrote up the code I outlined above and instead of code above I used the following code:

Mov rdi, [rdi];  where rdi is somekernelmodeaddress

mov rdi, [rdi + rcx+0]; where rcx is the Someusermodeaddress

Then I timed accessing the Someusermodeaddress after an exception handler dealt with the resulting exception. I did 1000 runs and sorted out  the 10 slowest out. First I did a run with the 2nd line above present with the value at the kernel mode address 0. I then did a second run with the second line commented out. This allows me test if I have a side channel inside the speculative execution. The results is summarized in the below histogram (mean and variance respectively: mean= 71.22 std = 3.31; mean= 286.17 std =  53.22). So obviously I have a statistic significant  covert channel inside the speculative execution.

 
In a second step I made sure that the kernel address I was trying to read had the value 4096.  Now if I’m actually reading kernelmode data the second line will fetch a cache line in the next page. Thus I would expect a slow access to Someusermodeaddress. As the speculative fetch should access a cache line exactly one page further. I selected the offset 4096 to avoid effects due to hardware prefetchers I could’ve added the size of a cache line. Fortunately I did not get a slow read suggesting that Intel null’s the result when the access is not allowed. I double checked by accessing the cache line I wanted to access and indeed that address was not loaded into the cache either. Consequently it seems likely that intel do process the illegal reading of kernel mode memory, but do not copy the result into the reorder buffer. So at this point my experiment is failed and thus the negative result. Being really curious I added an add rdi,4096 instruction after the first line in the test code and could verify that this code was indeed executed speculatively and the result was side channeled out.

 

Pandoras box

While I  did set out to read kernel mode without privileges and that produced a negative result, I do feel like I opened a Pandora’s box. The thing is there was two positive results in my tests. The first is that Intel’s implementation of Tomasulo’s algorithm is not side channel safe. Consequently we have access to results of speculative execution despite the results never being committed. Secondly, my results demonstrate that speculative execution does indeed continue despite violations of the isolation between kernel mode and user mode.

This is truly bad news for the security. First it gives microarchitecture side channel attacks additional leverage – we can deduct not only information from is actually executed but also from what is speculatively executed. It also seems likely that we can influence what is speculative executed and what is not through influencing caches like the BTB see Dmitry Evtyushkin and Dmitry Ponomarev [5] for instance.It thus add another possibility to increase the expressiveness of microarchitecture side channel attacks and thus potentially allow an attacker even more leverage through the CPU. This of cause makes writing constant time code even more complex and thus it is definitely bad news.

Also it draws into doubt mitigations that rely on retirement of instructions. I cannot say I know how far that stretches, but my immediate guess would be that vmexit’s is handled on instruction retirement. Further we see that speculative execution does not consistently abide by isolation mechanism, thus it’s a haunting question what we can actually do with speculative execution.

 

Literature

[1] Intel® 64 and IA-32 Architectures Software Developer Manuals. Intel. https://software.intel.com/en-us/articles/intel-sdm

 

[2] Tomasulo, Robert M. (Jan 1967). “An Efficient Algorithm for Exploiting Multiple Arithmetic Units”. IBM Journal of Research and Development. IBM. 11 (1): 25–33. ISSN 0018-8646. doi:10.1147/rd.111.0025.

 

[3] Fogh, Anders. “Covert shotgun: Automatically finding covert channels in SMT”

https://www.youtube.com/watch?v=oVmPQCT5VkY.

 

[4] Gruss, Daniel, et al. “Prefetch side-channel attacks: Bypassing SMAP and kernel ASLR.”

Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM, 2016.

 

[5] Evtyushkin D, Ponomarev D, Abu-Ghazaleh N. Jump Over ASLR: Attacking Branch Predictors to Bypass ASLR. InProceedings of 49th International Symposium on Microarchitecture (MICRO) 2016.

Statistics and Infosec

 

Introduction

 

The work by John Ioannidis [1] among others started the “reproduction crisis”. The “reproduction crisis” is the phrase used to describe that in many branches of science researchers have been unable to reproduce classic results which form the basis for our understanding of the world. Since the basic principle of modern science is about standing on the shoulders of giants, it is incredibly important that sound methodology is used to insure that our foundation is solid. For now the reproduction crisis has hardly reached information security and I hope it will stay that way. But unfortunately the information security scene is full of issues that make it vulnerable to reproduction problems. Top among them is methodology and the lack of reproduction studies. A huge amount of papers is accepted as true, but never reproduced and many papers suffer from methodologic problems.

 

In this blog post I will argue that among information security papers, very basic statistics could improve the process and reproducibility. Statistics is important as soon as observed data are noisy. Noisy data are everywhere even in computer science. Users are non-deterministic, unobserved variables, manufacturing variance, temperature and other environmental influences all effect our measurements. This of course does not mean that we cannot say anything meaningful about computers – it just means we need to embrace statistics and so far the information security science has not been very successfull at that. This purpose of this blog post is to suggest a baby step in the right direction – using sample size, mean and standard deviation as a mean of improving the expose on noisy data. I intentionally keep the theory short – I encourage everybody to read a real statistics book. Also, with the intention to provoke, I picked two examples of papers where the statistics could have been done better.

 

Statistics, Sample Size, Mean, and Standard Deviation 101

During a lecture my first statistics professor said that he would automatically flunk any of us in the future if we reported on noisy data without mentioning 3 data points on them. These data points are the sample size (N), the mean, and the standard deviation. The first reason for bringing these three data points is that it is as close as one gets to a standard in science . It is accepted in other sciences ranging from physics over medicine to economics. Having a standard on reporting noisy data allows us to compare studies easily and gives us an easy way to reproduce the findings of other people. Unfortunately, this standard appears to not have arrived in full force in information security yet.

But it is not only about a standard. The thing is: using these three values is not just a historical accident. There is sound theory why we want these three values. Imagine a process generating our data – usually called the data generating process or d.g.p. If the d.g.p. produces observations where the noise is independent of other observations we say the d.g.p. produces independent, identically distributed observations or i.i.d. If observations are not i.i.d. we cannot treat any variation as noise and consequently we need further analysis before we can draw any conclusion based on the data. Hence, i.i.d. is often assumed or considered a good approximation. And very often there are good reasons to do so.

With i.i.d. data the arithmetic mean of our observations will almost surely converge towards the mean of the d.g.p as the number of observations increases. This is called the law of large numbers. If one rolls a dice a thousand times and gets an arithmetic mean of 3.0, it is an indication that the dice is not a fair dice. This explains why we wish to report the arithmetic mean. If the number of samples is sufficiently high, then the arithmetic mean is “close” to the real mean of the data generating process.

Also with i.i.d. data the central limit theorem almost always applies. The central limit theorem says that we can expect the mean to tend towards normally distributed as the sample size increase. The normal distribution is characterized by only two parameters, its mean and its standard deviation. It allows us to check hypothesis about the data. The most basic tests can even be made on the back of an envelope. This allows to put a probability on the dice used in the previous example not being a fair dice – that we are not just observing noise.

Why we should include the sample size should be pretty obvious by now. Both the law of large numbers and the central limit theorem hinge on convergence over the sample size. Thus if we have a small sample size, we are unable to rely on any conclusions. That should not surprise you. Imagine throwing a dice 3 times getting 1,2, and 4 – you would be a fool to insist that your data show that 6 rarely happens. If you have thrown the dice a million times and do not see any 6’s the dice almost surely is not fair and the aforementioned conclusion would be pertinent.

In most first (or second if applied with rigor) semester statistics books, you can find a discussion on estimating how many samples you need based on the estimated standard deviation from a pilot sample or prior knowledge on the d.g.p. That of course can be turned around to evaluate if a study had sufficient sample size – yet another reason why these three numbers are valuable.

In short three small numbers in a 14 page paper can provide deep insight into the soundness of your methodology as well as making your paper easily comparable and reproducible.If  you are in doubt, you need to bring these 3 numbers. One word of caution though: I think these numbers are necessary but they may not be sufficient to describe the data. Also you will have no quarrel with me if I can easily calculate the standard deviation from reported data . For example report variance instead of standard deviation or omitting the standard deviation when reporting on proportions (the Bernoulli distribution is described only by its first moment) is just fine.

 

Case study 1: A Software Approach to Defeating Side Channels in Last-Level Caches

I will start out with a pretty good paper that could have been better if standard deviation and mean had been mentioned: “A Software Approach to Defeating Side Channels in Last-Level Caches “ by Zhou et al. [2]. The authors develop CacheBar – a method to mitigate cache side channel attacks. For Flush+Reload attacks they do a copy-on-access implementation to avoid sharing memory which effectively kills Flush+Reload as an attack vector. I shall concentrate on Zhou et al.s [2] protection against Prime+Probe. For this purpose, Zhou et al [2] use an overcommitted page coloring scheme. Using overcommitted page coloring allows for a significantly more flexible memory allocation and thus less performance penalty than the classic page coloring scheme. However, an attacker will remain able to Prime a cache set to a certain extend. Consequently, the scheme does not defeat Prime+Probe, but it does add noise. The paper analyzes this noise by first grouping observations of defender demand on a cache set into “None, one, few, some, lots and most” categories. They subsequently train a Baysian classifier on 500.000 Prime+Probe trials and present the classification result in a custom figure in matrix form. I have no objections thus far. If the authors think that this method of describing the noise provides insights then that is fine with me. I am however missing the standard deviation and mean – they do offer the sample size. Why do I think the paper would have been better if they had printed these two values?

  • The categories are arbitrary: To compare the results to other papers with noise (say noisy timers) we would have to categorize in exactly the same fashion.
  • Having reported means we would immediately be able to tell if the noise is biased. Bias would show up in the figures as misprediction but an attacker could adjust for bias.
  • With standard deviation we could calculate on a back of an envelope how many observations an attacker would need to actually successfully exfiltrate the information she is after and thus evaluate if the mitigation will make attacks impractical in different scenarios. For example, you can get lots of observations of encryption, but you cannot ask a user for his password 10000 times.
  • Reproducing a study will in the presence of noisy data always have slight differences. When the black and white matrix figures are close enough to consider reproduced? For means and standard deviation that question has been resolved.

That all said, I do not worry about the conclusion of this paper. The sample size is sufficient for the results to hold up and the paper is generally well written. I think this paper is a significant contribution to our knowledge on defending against cache side channel attacks.

 

Case study 2: CAn’t Touch This: Practical and Generic Software-only Defenses Against Rowhammer Attacks

Caveat, Apology and More Information

Before I start with my analysis of the statistics in this paper, I should mention that I had given my off-the-cuff comments on this paper before and I generally stand by what I wrote on that occasion. You will find my comments published in Security Week[6]. I did get one thing wrong in my off-the-cuff remarks: I wrote Brasser et al [3] draws heavily on Pessl et al. [4]. Brasser et al. do not use the Dram-mapping function used by Pessl et al.[4], as I initially thought they did. Instead, they use an ad-hoc specified approximate mapping function. Also it is very important to mention that I am referring to an early version of the paper on archiv substance to my previous off-the-cuff remarks or what I write in this blog post. As my critique of the paper is relatively harsh, I emailed the authors ahead of posting the blog post. As of publishing the blog post, I have not yet received an answer.

Getting dirty with it

“CAn’t Touch This: Practical and Generic Software-only Defenses Against Rowhammer Attacks” by Brasser et al. [3] implements two mitigations for the rowhammer problem: B-Catt and G-Catt. G-Catt uses memory partitioning, so that the kernel is not co-located in the same banks as user mode memory which prevents code hammering the kernel space from user mode. I am not convinced that G-Catt cannot be circumvented but that is not the subject of this blog post. I will instead focus on B-Catt. Kim et al [5] suggested not using vulnerable memory addresses and B-Catt implements this idea in a boot loader, rather elegantly using int 15h, subfunction 0xe820 which provides the operating system a list of memory regions it should avoid using. If the computer is not using any vulnerable addresses, one cannot flip bits with row hammer and the system is safe. Obviously, not using some of the physical memory comes at a cost for the end user. Kim et al. [5]concluded that “However, the first/second approaches are ineffective when every row in the module is a victim row (Section 6.3)”. Contrary, Brasser et al.[3] concludes: “…we demonstrate that it is an efficient and practical solution that effectively prevents rowhammer attacks as a short-term solution”. So, the big question is who it gets right here. The key issue is how many pages contain bit flips. Kim et. Al[5] do not provide numbers on pages, but Brasser et al do. They evaluate 3 test systems to support their conclusion: “However, our evaluation (Section VI-A2) suggests that only a fraction of rows are vulnerable in practice.”

And this is where my gripe with the statistics in this paper is. Sometimes you can get away with a tiny sample size. That is when you have prior evidence that the sample variance is either irrelevant or negligible. Relevance of the variance here is given. The prior evidence of a small variance is not present. In fact, I will argue quite the contrary in later.

But first let me get rid of the terrible loose language of “practical and efficient”. My 15 year experience as a professional software developer tells me that software needs to work as advertised on at least 99% of the systems. That is in my opinion a conservative value. Further assume that 5% memory overhead is acceptable for B-Catt users. Notice here that I defined practical as a fraction of systems that must be running with an acceptable overhead. The alternative would have been defining practical as a maximal acceptable average overhead. I consider the fraction approach the most applicable approach in most real world scenarios – it certainly is easier to do back-of-envelope calculations with, which I will be doing for the remainder of this blog post.

Now we can put the sample size of 3 into perspective. Imagine that in reality 10% of systems have more pages with bit flips than are acceptable. With a sample  size of 3 that gives us a 73% (0,93) chance that our evaluation will be “suggesting” that we are indeed practical. Being wrong 73% of the time is not a suggestion of being right. We need more data than 3 observations to assume practicality. Obviously, you may argue that my 10% is pulled out of a hat – and it is indeed. It is a conservative number, but not unrealistic.

This leads us to the question of how much data do we need? The answer to that question depends on the variance in the data, something that we do not know. So, we can either make an a-priori guess based on domain knowledge or do a pilot sample to get an idea. My first semester statistics book [6] states: “We recommend taking at least 20 observations in a pilot sample”. Pp. 419. So, the author’s samples do not even qualify as a pilot sample.

So we are left with turning to domain knowledge. Kim et. al [5] in my opinion has the best available data. They sample 129 modules but do not mention the number of affected pages. They do however mention that 3 modules have more than 40% affected rows. On a two channel skylake system (to my knowledge the best case) we can have 4 rows per page and assume all bit flips occur so that the smallest number of pages are affected – then we end up with 3 modules with at least 10% overhead out of 129. Thus no lesss 2,3% of modules will have too much overhead, according to my definition, in Kim’s samples. This certainly constitutes a reason for skepticism on Brasser et al’s[3] claims.

Kim et al. [5] unfortunately do not calculate a mean or standard deviation – they do report their complete data, so that I could calculate it, but instead I will use a bit of brute force to get somewhere. We know that rowhammer is fairly bad because 6 of 129 samples have more than 1 million bit flips. Assuming that each cell is equally likely to flip that gives us a 14,8% probability that a given page is vulnerable for any of these 6 modules. This assumption has weak support in Kim et. al. [5] Consequently, at least 4.6% of the modules are not practical under my definition and my fairly conservative assumptions. With a finite base of modules (200) and under the assumption that modules are equally common in the wild and 129 observations, we can calculate a 95% confidence interval. This confidence interval tells us that between 2.44% – 6.76% of the assumed 200 modules in the population are vulnerable. Obviously, it is not realistic to assume only 200 modules in existence, but if there are more the confidence interval would be even larger. Thus, under these assumption we must dismiss that B-Catt is practical with 95% confidence. It looks like B-Catt is in trouble.

The assumptions above are of cause very restrictive and thus more research is required. What would this research look like? A pilot sample with at least 20 samples would be a good start. Alternatively, we could use my above assumptions to calculate a sample size. If we do this we end up with the number 80.

Obviously, everything here rides on my definition of practicality and efficiency. My gut feeling is that B-Catt won’t turn out to be practical if we had a reasonable sample. Nevertheless I would love to see data on the issue. While I think my assumptions are fair for a mainstream product, there might be special cases where B-Catt may shine. For a data center that can search for a ram product with few bit flips B-Catt may very well be a solution.

 

Literature

[1] Ioannidis, John PA. “Why most published research findings are false.” PLos med 2.8

(2005): e124.

[2] Zhou, Ziqiao, Michael K. Reiter, and Yinqian Zhang. “A software approach to

defeating side channels in last-level caches.” Proceedings of the 2016 ACM SIGSAC

Conference on Computer and Communications Security. ACM, 2016.

[3] Brasser, Ferdinand, et al. “CAn’t Touch This: Practical and Generic Software-only

Defenses Against Rowhammer Attacks.” arXiv preprint arXiv:1611.08396 (2016).

[4] Pessl, Peter, et al. “DRAMA: Exploiting DRAM addressing for cross-cpu

attacks.” Proceedings of the 25th USENIX Security Symposium. 2016.

[5] Kim, Yoongu, et al. “Flipping bits in memory without accessing them: An

experimental study of DRAM disturbance errors.” ACM SIGARCH Computer

Architecture News. Vol. 42. No. 3. IEEE Press, 2014.

[6] Berry, Donald A., and Bernard William Lindgren. Statistics: Theory and methods.

Duxbury Resource Center,

[7] Kovacs, Eduard. “Researchers Propose Software Mitigations for Rowhammer

Attacks” http://www.securityweek.com/researchers-propose-software-mitigations-

rowhammer-attacks

New cache architecture on Intel I9 and Skylake server: An initial assessment

 

Intel has introduced the new I9 CPU which is seen as HEDT (High-End-DeskTop) product. The micro architecture is in many respects shared with the new Skylake server micro architecture.I f history is a guide technology introduced in this segment slowly trickles down to more budget friend desktops. From a Micro architecture point of view it seems that several things about these CPU’s will force changes on micro architectural attacks – especially in the memory subsystem. In this blog post I’ll do a short overview on some of relevant changes and the effects they’ll may have on micro architecture attacks. Since I don’t own or have access to an actual I9 processor or Skylake server processor this blog post is just conjecture.

 

Resume of the “old” cache hierachy

The major changes over earlier process from a micro architecture point of view is that the cache system has received a significant overhaul. Current Intel CPUs have a 3 level cache hierarchy – two very small L1 caches, one for data, one for instructions. The second level cache (L2 or Mid Latency Cache) is somewhat larger. L1 data, L1 code and L2  part of each core and private to the core. Finally, Intel CPU’s had a huge 3rd level cache (usually called L3 or  largest latency cache) shared between all cores. The 3rd level cache is subdivided into slices that are logically connected to a core. To effectively share this cache, Intel connected them on a ring bus called the Quick Path Interconnect. Further the 3rd level cache was an inclusive cache, which means that anything that is anything cached in L1 or L2 must also be cached in L3.

 

Changes

Some of the important changes that has been announced in the Intel Software Optimization Manuals [1]  are:

–    A focus on a high number of cores in the CPU (up to 18 in the HEDT models)

–    A reduced over all cache size per core (compared to similar older models)

–    A very significant increase in the size of the L2 (factor of 4)

–    Doubled the bandwidth of L2, while only slightly increasing the latency

–    Slightly more than offset by a reduction of the shared L3.

–    Reorganized  L3 cache to be a non-inclusive cache

–    Replaces the QPI with a mesh-style bus.

Why does these changes make sense?

Increasing the size of L2 on the cost of L3 makes sense as the L2 is much faster than L3 for applications – one can only assume that making the L3 helps reduce die size and cost. Also the increase in size of the L2 caches reduces the marginal utility of the L3. Finally  as the probability of cache set contention rises with the number of cores, it becomes advantageous to make a larger part of the total cache private. Cache contention in L3 is a problem Intel has battled before. With Haswell they introduced Cache Allocation Technology (CAT) to allow software to control cache usage of the L3 to deal with contention.

The number of cores is probably also the reason why Intel dropped the QPI ring buffer design. There is a penalty for having memory served from another core’s slices. On a ring bus this penalty is proportional to how far the cores are a part on the ring. With more cores, this penalty increases. Moving to a more flexible system seems sensible from this vantage point.

Having an inclusive L3 makes cache coherency easier and faster to manage. However, an inclusive cache comes with a cost as the same data are loaded in multiple caches. The relative loss of total cache storage space is exactly the ratio of the (L2 +L1) to L3 sizes. Previous this ratio has been around 1:10 (depending on actual CPU), but multiplying the L2 size by 4 and making the L3 a tiny bit smaller the ratio is now about 1:1.5. Thus the making the L3 cache non-inclusive is very essential to performance. At this point it’s important to notice that Intel uses the wording “non-inclusive”. This term is not well defined. The opposite of inclusive is exclusive meaning the content of L3 cannot be loaded in L1 and L2. I think Intel would have used the defined term exclusive if the cache really where exclusive. Thus,   it is probably fair to assume that non-inclusive means that data may or may not be cached in L1, L2 and L3 at the same time, but exactly how this is governed is important. Unfortunately there is no information available on this.

It’s worth noting that many of these changes has been tested and developed by Intel for the Knights landing micro architecture. Knights landing is a high throughput micro architecture, sold in relative small amounts. Thus it’s likely that many features developed for this CPU will end up being trickled down.  It’ll be interesting to see if Intel plans to trickle it down into laptop/small desktops or use different cache designs for different classes of CPUs.

 

Effects

Cache side channel attacks

This new cache layout is likely to have profound effect on cache side channel attacks. I think Flush+Reload will work even without the non-inclusive cache. The flush primitive is based on the CLFlush instruction which is part of the instruction set architecture (ISA). Intel has been very reluctant in the past to change the ISA and therefore my estimate is that flushing will work as always. I think the reload primitive will remain active – I find it likely that an uncached load will also load stuff into the shared L3. Also it’s likely that the QPI replacement bus can be used to fetch data from private L2 caches similar to AMD’s cross CPU transmission. This will give Flush+reload a nice flush+transfer flavor, but it’ll still work. For more on Invalidate (Flush)+Transfer see [2]. Since the L3 cache must be filled in someway we can fairly certain that at least one of these things are true, if not both. The latter being my guess.

The big change are likely to be related to evict and prime primitives. Assuming that cache contention between cores was a major reason for redesigning the cache, it’s unlikely that one can can load stuff into another core’s private hierarchy. Thus, the prime and evict primitives for  cross core attacks.  However,both are likely to work decently within a core (Hyper threading or scheduling on same core).

While the ISA behavior of CLFLush is almost certain to remain unchanged, the micro architecture below it will see significant changes. With QPI gone using the flush+flush attack by Gruss et al. [3] to find out how many hubs on the ring bus away you are from a particular slice almost certainly won’t work. This does not mean that you won’t find a side channel here in CLFlush, buses are typically bandwidth limited thus being an obvious source of congestion – without an inclusive L3 the bus bandwidth might even be easier to consume. Also the Flush+Flush attack as a Flush+reload replacement, is likely to produce a different timing behavior in the new micro architecture. My guess upfront is that a timing difference and thus a side channel remains.

Also affected by the non-inclusiveness of the L3 is row buffer side channel attacks such as those presented by Pessl. et al.[4]  Without an effective eviction cross core attacks may be severely stifled. The ability to reverse engineer the DRAM complex mapping function is likely to remain unchanged as it hinges not on eviction, but the CLFlush instruction’s ISA behavior.

With the CLFlush instruction and evict likely still working on local cores, row hammer will  remain effective. But in some scenarios, indirect effects of  micro architecture changes may break specific attacks such as that of  Sarani Bhattacharya, Debdeep Mukhopadhyay [5]. The reason is that such attacks rely on information leakage in the caches, that becomes more difficult to leverage as described above.

 

Future

While the changes to the cache makes sense with the significant number of cores in the affected systems, it seems unlikely that the changes will trickle down to notebooks and laptops. With only two cores the existing cache design seems sensible. Thus we are likely to see a two tier cache design moving on.

Conclusion

Having a non-inclusive L3 cache is significantly more secure from a side channel perspective than an inclusive in cross core scenarios. This opens up for defending these attacks, by isolating different security domains on different cores potentially dynamically. While flush+reload is likely to be unaffected, this attack is also the easiest to thwart in real life scenarios as avoiding shared memory cross security domains is an available and effective countermeasure. Lots of new research is required to gauge the security of these micro architecture changes.

 

Literature

[1] Intel. Intel® 64 and IA-32 Architectures Optimization Reference Manual. July 2017. https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf

[2] Irazoqui, Gorka, Thomas Eisenbarth, and Berk Sunar. “Cross processor cache attacks.” Proceedings of the 11th ACM on Asia Conference on Computer and Communications Security. ACM, 2016.a

[3] Gruss, Daniel, et al. “Flush+ Flush: a fast and stealthy cache attack.” Detection of Intrusions and Malware, and Vulnerability Assessment. Springer International Publishing, 2016. 279-299.

[4] Pessl, Peter, et al. “DRAMA: Exploiting DRAM Addressing for Cross-CPU Attacks.” USENIX Security Symposium. 2016.

[5] Sarani Bhattacharya, Debdeep Mukhopadhyay: “Curious case of Rowhammer: Flipping Secret Exponent Bits using Timing Analysis”. http://eprint.iacr.org/2016/618.pdf

Zeus Panda Webinjects: Don’t trust your eyes

In our last blog article Zeus Panda Webinjects: a case study, we described the functionality of current Zeus Panda webinject stages and gave some insight into the corresponding administration panel. As we only scratched the surface of the target specific second webinject attack stage (in the following we reference this as 2nd attack stage), we would like to share more details about this part.

Basically, the 2nd attack stage already includes the complete code needed for the attack. The different code branches are triggered by setting status variables, especially the branch variable already introduced in our previous article on that topic. Last time we also introduced the send() function, which is used to exfiltrate data. send() isn’t entirely unidirectional: the HTTP response of this request includes further code that is evaluated as JavaScript. Thereby the backend is able to set the different status variables to trigger the existing code branches of the 2nd attack stage. Let’s dive into the details of this communication protocol:

Communication protocol and status variables

sequence_zeus_panda
Figure 1: Communication protocol

Figure 1 illustrates the communication protocol between the 2nd attack stage and the backend server. We see the different steps of the communication, the branches triggered, and the website on which the step occurs. Before going into details, the concept behind the communication is the following:

  1. The current attack state is sent from the client to the backend server.
  2. The backend checks for the current attack state and sets the right response parameters to initiate the next attack stage.
  3. The backend response contains variables to notify the 2nd attack stage (client), which attack branch should be executed next.
  4. The 2nd attack stage evaluates the response variables and triggers the next branch.
  5. This procedure is repeated until the final state of the protocol is reached.

Time to branch

Let’s take a detailed look into the different branches now.

Step
Action
1 The SL branch is triggered at the beginning of the attack, when an infected victim accesses the login page of the targeted online banking, inserts the login credentials and clicks on the submit button. (NOTE: The low level Trojan functions need to trigger an the initial webinject (generic loader) on that website and therefore the URL of the online banking website has to be listed in the trojan config file). The submitted login credentials are intercepted, exfiltrated to the backend (see previous blog post), then the 2nd stage code calls the original login function of the banking or payment website. The backend now registers the new victim, identified by the botid. It returns an empty response to the webinject.
2 At this point, the victim has successfully logged in and has been redirected to the account balance overview page. This triggers the 2nd branch: CP. The CP branch is called multiple times during the attack and transmits general status information of the victim to the attacker. The response of the backend contains status flags to trigger the next step of the attack. At this point here, the backend signals to initiate the attack.
3 The attack signal triggers the 3rd step shown in Figure 1: The TL branch. This branch is used to collect details from all available accounts by using the grabber module. Furthermore, a flag is set to indicate a page reload after the response of the send function has been received. The collected data is then exfiltrated again. The botid is used to correlate transmitted data to existing victim entries in the backend and therefore works as unique identifier for the victim. The server response is empty, but the previously set reload flag now triggers the CP branch again.
4 The CP branch now sends the some information to the backend as  described in Step 2. As the backend has stored a different state for the botid already, the response is different now. It signals the 2nd attack stage that the grabber module has finished and the ats module should start now. This module is used to manipulate account details like the account balance or transaction details. Also some status flags are set to trigger the next branch.
5 The GD branch: This branch is used to collect and exfiltrate account details of the victim. As already described in step 3, the reload flag is used to trigger the CP branch again.
6 The CP branch again submits status information, and the backend now triggers the next step of the attack. Besides some status flags, details about the target account and some fake data is provided. The data is used by the CP branch to display a fake overlay with a message and/or images, to trick the victim into starting a transaction. To that end, the fake overlay is used like in a normal phishing attack. We could observe different kinds of messages, which could be categorized into different modi operandi. (see below).

If the victim fell for the scam, the previously provided data is used to pre-fill the transaction form. Naturally, this data contains a target account for the transaction. This account will be controlled by the attacker somehow, i.e., it most likely belongs to a money-mule.

Additionally, the response from the backend contains fake information to be displayed. Depending on the modus operandi, this information is used to display different transaction details to the victim, then the ones used for the transaction in background.

7 Now the victim is redirected to the overview page for a successful transaction. In combination with the current flag state, this page visit triggers the TL branch of the 2nd stage code. The TL branch is used to collect details from the transaction overview page and exfiltrates them to the backend. This indicates a successful transaction to the attacker. The backend response is empty. The webinject transits into the next state, without the need for further communication with the backend.
8 The last triggered branch is called CG. It creates a copy of the complete DOM of the successful transaction overview page and exfiltrates it to the backend. There is no indication that this data is displayed in the admin panel, thus we assume it is transmitted for debug purposes only.

Modi Operandi

In the following we detail two different exemplary modi operandi, which we could observe during our analysis. The real visible appearance is different, as the webinject makes heavy use of the style-sheets provided by the target website. This is a very straight-forward way to properly brand fraudulent content to match the corporate design of target banks or payment providers. We focus on the content shipped to banking customers.

Charity Fraud: SOS-Kinder

SOS_Kinder.png

The victim is asked to donate 1€ to an non-profit organization, in this case for SOS children. This mimics the well know internationally active “SOS-Kinderdorf” organization. The German text is well written and does not contain the obvious indications for phishing that we all love and know from the occasional phishing mail, like contorted grammar and a more than flowery vocabulary. No Google Translate in sight, here. To leverage this scam vector, the webinject makes use of the data provided by the backed in Step 6 as detailed above. Using an overlay, the victim is made believe he/she is transferring 1€, but under the hood the amount is change to a much higher value.

The attackers follow a very classic social engineering approach for our part of the world and appeal to the victims helpfulness: Who doesn’t want to help children in need by spending 1€? We refer to this kind of attack as charity fraud.

Refund Fraud: Finanzpolizei

finanzpolizei_edited

The overlay presents a message to the victim, indicating a transaction has been made to their account. As the victim sees a manipulated version of his account balance, he really believes the transaction has had happen. Furthermore the text indicates a preliminary investigation by the “Finanzpolizei” against the initiator of the transaction. If the victim is not transferring the money back, the text threatens with prosecution by law enforcement for participating in a money laundering scheme.

Finally, all the Google Translate and contextual cluelessness we came to love in the scams out there! Regrettably for the attacker, not all German-speaking countries are actually Germany. (We tried that once, partially, and it was a horrible idea.) An institution called “Finanzpolizei” does indeed exist — but not in Germany. The valid target audience for this scam is thus supposedly to be found in Austria, however, the scam is also actively used in Germany. The German text includes some mistakes and is not as well written as the first modus operandi we have shown above.

In the case at hand, the attackers try to make the victim follow through with a classic refund scam, by threatening legal consequences. As the story works without the need to manipulate the transferred amount under the hood, the fake data needed in the first described modus operandi is not used in this kind of attack. Nevertheless the attack is kind enough to prefill the transaction form with the correct details to ease the transaction for the victim.

Return of the victim

Now let’s assume the victim has been tricked into initiating a transaction by themselves to send their money to the attacker. What happens, if the victim takes a look into his online banking account some time later? As expected, the 2nd attack stage is also prepared for that case: The user is presented the “temporarily unavailable” notification (see Figures 1 and 2 from our previous post) and the login function of the target website is disabled. As long as the status variables are set to the finale state of the described communication protocol, the victim is thus unable to access their account again as long as the backend server is reachable. Even when disabling this blocking functionality, account information like transaction details and total balance are still manipulated. As this manipulations use the originally provided style cheets (CSS) from the target institute, a victim has no way to visibility distinguish between a fake entry and an original one.

Conclusion

Nowadays almost all financial institutes make use of two-factor authentication to protect their users from fraud. The modi operandi used by current banking trojan attacks successfully circumvent this by using social engineering techniques. The victim is tricked into initiating the transaction willingly and happily provides all information needed to confirm the transaction. This is achieved by visible modifications of the website that are indistinguishable from the original website content. The success rate of these attacks is still quite high.

By using a multi-layered attack, it’s also cumbersome for analysts to get an complete insight into the technical details. As soon as the backend server is not available anymore, only the 1st stage of a webinject is accessible on an infected machine. Without the backend server, most of the attack code is not available and therefore some pieces of the puzzle are missing.

These kind of multi-layered attacks have become more and more complex and sophisticated. However, beyond the visual appearance, the code of the original website is modified heavily to make this attacks work and these modifications necessarily leave a footprint. In our fraud detection solutions, we provide our customers with instant visibility into these modification symptoms so they can fare better at protecting their customers’ assets.

Authors: Manuel Körber-Bilgard and Karsten Tellmann