June 2016 – cyber.wtf

Row hammer the short summary

Introduction

This is the first updated version of my original “Row hammer the short summary” blog post. As I had predicted the summer was going to be interesting in terms of row hammer and it certainly appears I was right about that. With so much going on I found it worthwhile updating this blog post to be in line with the latest developments and to fix up a few minor details.

Short version of how dram works.

Current DRAM comes in modules called DIMM’s. If you buy a modern memory module for your PC, you’re buying a DIMM. If you look at the DIMM most DIMMs will have chips on both sides. Each side of the DIMM is a rank. Each rank again consists of a number of banks. The banks are in the physical individual chips you can see. Inside a bank you’d find a two dimensional matrix of memory cells. There are 32k rows in the matrix and 16k or 512k cells per row. Each cell stores one bit and consists of a transistor for control and a capacitor which stores charge to signify bit is equal to 1 and no charge when bit is equal to 0 (on some chips the encoding is reversed). Thus a row stores 8kb or 64kb of data depending on the exact kind of DRAM you have in front of you. When you read or write from/to DRAM an entire row is first read into a so called so called row buffer. This is because for reading automatically discharges the capacitor and since writes rarely rewrite the entire row. Reading a row into the row buffer is called activating the row. An active row is thus cached in the row buffer. If a row is already active, it is not reactivated on requests. Also to prevent the capacitors loose charge overtime they are refreshed regularly (typically every 64 ms) by activating the rows.

Row hammer introduction

This section is based on [1] Kim et Al. where not otherwise noted.

When a row is activated a small effect is caused on the neighboring row due to so called cross talk effects. The effect can be caused by electromagnetic interference, so called conductive bridges where there is minor electric conductivity in dram modules where it shouldn’t be. And finally, so called hot-carrier-injection may play a role where an electron reaches sufficient kinetic energy where it leaks from the system or even permanently damage parts of the circuitry. The net effect is a loss of charge in the DRAM cell which if large enough will cause a bit to flip.

Consequently, it’s possible to cause bits to flip in DRAM by reading or writing repeatedly and systematically from/to two rows in DRAM to (active the rows) bit flips can be introduced in rows up to 9 rows away from these two “aggressor rows”. The 9 rows are called victim rows. The most errors happen in the row immediately next to an aggressor row. Picking the aggressor rows so they bracket a victim row is called double sided row hammering and is far more efficient that normal row hammering. Using two adjacent rows to hammer surrounding rows is called amplified single sided hammering and can be useful in exploitation scenarios. If the victim rows are refreshed before enough cross talk can be generated no bit flips is incurred. As a rule of thumb the higher the frequency of row activation the higher the probability of flipping bits.

It has been shown that bits can be flipped in less than 9 milliseconds and typically requires around 128k row activations. [3] Seaborn & Dullien has reported bit flips with as little as 98k row activations.

The problem occurs primarily with RAM produced after 2010. In a sample of 129 RAM modules from 3 manufacturers over 80% where vulnerable. With all modules produced after 2012 being vulnerable. [4] Lanteigne showed that DDR4 ram is vulnerable too with 8 out of 12 sampled DRAMs was subject to bit flips. Further this paper showed that certain patterns in the DRAM rows where more likely to cause bit flips.

[21] Lateigne concludes that AMD and Intel CPU’s are both capable of row hammering, but that the most bit flips are encountered when the methodlogy is adapted to the underlying memory controller in the attacked system.

Physical addressing and finding banks and row

Obviously to cause row hammering one needs two addresses belonging to rows in the same bank. [2] showed that repeatedly choosing two random addresses in a large buffer would in a practical amount of time belong to the same bank and thus be useful for hammering in software.

An optimal solution requires that the attacker has knowledge of physical addresses. Even with a physical address an attacker would need to know how they map to dimm, banks and rows to optimally row hammer. [5] Seaborn used row hammer itself to derive the complex function that determines physical address to dram location for a sandy bridge computer. [6] Pessl et al. showed how to use “row buffer side channel attacks” to derive the complex addressing function generically and provided maps for many modern intel CPU’s.

To obtain physical addresses the /proc/$PID/pagemap could provide this information. However, /proc/$PID/pagemap, which is not available in all operating systems and no longer affords unprivileged access in most operating systems that do support it. This problem for an attacker remains to be definitively solved.

Row hammering with Software

To cause row hammer from software you need to activate memory rows, that is cause reads or writes to physical memory. However modern processors are equipped with caches, so that they don’t incur serious speed penalties when memory is read or written. Thus to cause row hammering bit flips it’s required to bypass the caches.

[1] did this using the clflush instruction that removes a particular address from the cache causing the next read of the address to go directly to memory. This approach has two down sides. First, since clflush is a rare instruction, validator sandboxes (such as NaCL of google chrome) can ban this instruction and thus defeat this attack. Second Just-in-time compilers and existing code on computers generally do not use this opcode disabling attack scenarios where jit compilers are used (such as javascript) or for the future using existing code in data-only attacks.

[7] Aweke posted on a forum he was able to flip bits without clflush – he did not say how, but it was likely using the same method as [8] which systematically accessed memory in a pattern that causes the processor to evict the address of the attacker row from the cache causing the next read to go to physical memory. Unfortunately, how to evict caches is CPU dependent and undocumented and despite [8] Gruss, Maurice & Mangard mapping out how to optimally evict on most modern CPU’s it’s not the most trivial process. Typically, this requires knowledge of the physical address discussed above as well as a complex mapping function for cache sets. It is however possible to approximate this either through using large pages or through timing side channels. Also it is slower and thus less efficient than the clflush version above. Since this vector does not require special instructions, it’s applicable to native code (including sandboxed code), java script and potentially other JIT compiled languages.

[9] Qiao & Seaborn found out that the movnti instruction is capable of by passing the caches on it’s own. Further this instruction is commonly used – including as memcpy/memset in common libraries and thus difficult to ban in validator sandboxes and lowers the burden for future row hammer as a code reuse attack. It remains to be seen if JIT compiled languages can make use of it.

Finally, [10] Fogh showed that the policies that maintains the coherency of multiple caches on the CPU can be used to cause row activations and speculated it would be fast enough for row hammering. Since the coherency policies are subject to much less change than cache eviction policies and does not require any special instructions this method may solve problems with the other methods should it be fast enough. This remain to be researched.

Exploiting row hammer

[2] showed that row hammer could be used to break out of the NaCL chrome sandbox. The NaCL sandbox protects itself against by verifying all code paths before execution and block the use of undesired activities. To avoid new unintended code paths to be executed the sandbox enforces a 32 bit aligned address for relative jumps and adding a base address. In code it looks like this:

and rax, ~31a

add rax, r15  //(base address of sandbox)

jmp rax

Bit flips in these instructions often cause other registers to be used and thus bypass the sandbox enforcing limits on relative jumps. By spraying code like the above then row hammer, check for usable bit flips and finally use one of these relative jumps to execute a not validated code path and thus exit the sandbox. Not validated code path can be entered through code reuse style gadgets.

The second and more serious privilege escalation attack demonstrated by [2] was from ring 3 user privileges to ring 0. Since adjacent physical addresses has a tendency to be used at the same time, CPU vendors map adjacent physical addresses to different parts of RAM as this offers the possibility of memory request being handled by DRAM in parallel. This has the effect that banks are often shared between different software across trust boundaries. This allows an attacker to flip bits in page table entries (PTE). PTE’s is central in the security on x86 and controls access writes to memory as well as mapping between virtual and physical addresses. By repeatedly memory mapping a file many new PTE’s are sprayed and the attack has a good chance of hitting by random using row hammer. The attacker hopes that a bit flips so that a PTE with write privileges maps to a new physical address where another PTE is located. By strategically modifying this second PTE the attacker has read & write access to the entire memory.

[18] Bhattacharya & Mukhopadhyay. Used to Row Hammer to extract a private RSA 1024 bit key. Their attack used a combination of PAGEMAP, a cache side channel attack (prime+probe) and a row buffer side channel attack to find the approximate location in physical memory of the private key. Once located row hammer is used to introduce a fault in the private key, and fault analysis is then used to derive the real private key. This make the attack somewhat unique in that it’s the only attack so far that does not rely on any kind of spraying.

[19] Bosman et. Al. Breaks the Microsoft Edge’s javascript sandbox . First they use two novel dedublication attacks to gain knowledge about the address space layout. This allows them to create a valid looking, but counterfeit java object in a double array. They then find bit flips by allocating a large array and using single sided row hammering. The method used is similar to [8] but they also notice and exploit the fact that pages 128k appart are likely to be cache set congruent. Once they know where the bit flips are they can place a valid object at this address, that is so crafted that the bit flip will change it to a reference to the counterfeit object. Once this is set the row hammering is repeated and they now have a reference for the counterfeit object that can be used by compiled javascript. Further the object can be edited through the double array in which it was created and this allows arbitrary memory read and write.

[20] Xiao et al. The content of this paper is unknown to me, yet the title suggests that cross-VM and a new kind of privilege escalation is possible with row hammer.

It is likely that row hammer can be exploited in other ways too.

Row hammer mitigation

Most hardware mitigations suggest focuses on refreshing victim rows thus leaving less time for row hammer to do its work. Unfortunately, during the refresh ram is unable to respond to requests from the CPU and thus it comes at a performance penalty.

The simplest suggestion is increase the refresh rate for all ram. Much hardware support this already for high-temperatures. Usually the refresh rate is doubled, however to perfectly rule out row one would need to increase the refresh rate more than 7 fold [1]. Which in term is a steep performance penalty and a power consumption issue.

TTR [17] is a method that keeps track of used rows and cause targeted refresh of neighbors to minimize the penalty. The method needs to be supported in both CPU as well as RAM modules. MAC also known as maximum activation count keeps tracks of how many times a given row was activated. pTTR does this only statistically and thus cheaper to build into hardware. PARA [1] is another suggested hardware mitigation to statistically refresh victim rows. ARMOR [16] a solution that keep tracks of row activation in the memory interface.

It has been suggested that ECC ram can be used as a mitigation. Unfortunately, ECC ram will not to detect or correct bit flips in all instances where there are multiple bit flips in a row. Thus this leaves room for an attack to be successful even with ECC ram. Also ECC ram may cause the attacked system to reset, turning row hammer into a Denial of Service attack. [4] Suggests this problem persists in real life experiments. Keeping track of ECC errors may however serve as an indication that a system was under attack and could be used to trigger other counter measures.

Nishat Herath and I suggested using the LLC miss performance counter to detect row hammering here [11] Fogh & Nishat. LLC Misses are rare in real usage, but abundant in row hammering scenarios. [12] Gruss et al. ,[13] Payer refined the method respectively with correcting for generally activity in the memory subsystem. The methods are likely to present false positives in some cases and [11] and [13] therefore suggested only slowing down offenders to prevent bit flips. [14] Aweke et al. presented a method that first detected row hammering as above, then verified using PEBS performance monitoring, which has the advantage of delivering an address related to a cache miss and thus grants the ability to selectively read neighbor rows and thus doing targeted row refresh in a software implementation. [15] Fogh speculated that this method could be effectively bypassed by an attacker.

Litterature

[1] Yoongu Kim, R. Daly, J. Kim, C. Fallin, Ji Hye Lee, Donghyuk Lee, C. Wilkerson, K. Lai, and O. Mutlu. Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors. In Computer Architecture (ISCA), 2014 ACM/IEEE 41st International Symposium on, pages 361–372, June 2014.

[2] Mark Seaborn and Thomas Dullien. Exploiting the DRAM rowhammer bug to gain kernel privileges. March 2015.

[3] Mark Seaborn and Thomas Dullien. “Exploiting the DRAM rowhammer bug to gain kernel privileges”. https://www.blackhat.com/docs/us-15/materials/us-15-Seaborn-Exploiting-The-DRAM-Rowhammer-Bug-To-Gain-Kernel-Privileges.pdf

[4] Mark Lanteigne. “How Rowhammer Could Be Used to Exploit Weaknesses in Computer Hardware

“. Third I/O. http://www.thirdio.com/rowhammer.pdf

[5] Mark Seaborn.” How physical addresses map to rows and banks in DRAM”;

[6] Peter Pessl, Daniel Gruss, Clémentine Maurice, Michael Schwarz, Stefan Mangard: „Reverse Engineering Intel DRAM Addressing and Exploitation“

[7] Zelalem Birhanu Aweke, “Rowhammer without CLFLUSH,” https://groups.google.com/forum/#!topic/rowhammer-discuss/ojgTgLr4q M, May 2015, retrieved on July 16, 2015.

[8] Daniel Gruss, Clémentine Maurice, Stefan Mangard: “Rowhammer.js: A Remote Software-Induced Fault Attack in JavaScript”

[9] Rui Qiao, Mark Seaborn: “A New Approach for Rowhammer Attacks”.http://seclab.cs.sunysb.edu/seclab/pubs/host16.pdf

[10] Anders Fogh: “Row hammer, java script and MESI”http://dreamsofastone.blogspot.de/2016/02/row-hammer-java-script-and-mesi.html

[11] Anders Fogh, Nishat Herath. “These Are Not Your Grand Daddys CPU Performance Counters”. Black Hat 2015. http://dreamsofastone.blogspot.de/2015/08/speaking-at-black-hat.html

[12] Daniel Gruss, Clémentine Maurice, Klaus Wagner, Stefan Mangard: “Flush+Flush: A Fast and Stealthy Cache Attack”

[13] Mathias Payer: “HexPADS: a platform to detect “stealth” attacks“. https://nebelwelt.net/publications/files/16ESSoS.pdf

[14] Zelalem Birhanu Aweke, Salessawi Ferede Yitbarek, Rui Qiao, Reetuparna Das, Matthew Hicks, Yossi Oren, Todd Austin:”ANVIL: Software-Based Protection Against Next-Generation Rowhammer Attacks”

[15] Anders Fogh: “Anvil & Next generation row hammer attacks”. http://dreamsofastone.blogspot.de/2016/03/anvil-next-generation-row-hammer-attacks.html

[16] http://apt.cs.manchester.ac.uk/projects/ARMOR/RowHammer/armor.html

[17] http://www.jedec.org/standards-documents/results/jesd209-4

[18] Sarani Bhattacharya, Debdeep Mukhopadhyay: “Curious case of Rowhammer: Flipping Secret Exponent Bits using Timing Analysis”. http://eprint.iacr.org/2016/618.pdf

[19] Erik Bosman, Kaveh Razavi, Herbert Bos, Cristiano Giuffrida “Dedup Est Machina: Memory Deduplication as an Advanced Exploitation Vector”

[20] Yuan Xiao, Xiaokuan Zhang, Yinqian Zhang, and Mircea-Radu Teodorescu:”One Bit Flips, One Cloud Flops: Cross-VM Row Hammer Attacks and Privilege Escalation”. To be published

[21] Mark Lateigne: “A Tale of Two Hammers. A Brief Rowhammer Analysis of AMD vs. Intel”. http://www.thirdio.com/rowhammera1.pdf

Cache side channel attacks: CPU Design as a security problem

End of May I had the opportunity to present my research on cache side channel attacks at the “Hack In The Box” conference. After my presentation with Nishat Herath last year at black hat I published my private comments to that slide deck and that was well received. I had decided to do that again for “Hack In The Box”, unfortunately it took me a little longer to translate my comments into something human readable. But here they are. Since the comments relate directly to a specific slide in the slide deck you’ll probably want to have the slide deck open when reading this blog post. You can find them here: https://conference.hitb.org/hitbsecconf2016ams/materials/D2T1%20-%20Anders%20Fogh%20-%20Cache%20Side%20Channel%20Attacks.pdf

Cache side channel attacks: CPU Design as a security problem

The title of the presentation took quite a while to figure out because I wanted a title that fairly and accurately described the content of the presentation, but one that was also slightly catchy. I like what I came up with.

Here I told how I got into cache side channel attacks as an introduction. I was working on a talk about row hammer detection, when Daniel Gruss tweeted he’d been able to row hammer in Java script. Consequently, I had to figure out how he did it, so that I could speak with confidence. Everything pointed towards the cache side channel attack literature, so I dug in and 4 hours later produced my first cache side channel attack. This side channel attack triggered my row hammer detection and that let me into a year worth of playing with CPU caches.

Agenda

When I start doing slides, I always think about what I want to come across to the listener well knowing that most people will have forgotten most two hours later.

1) I wanted to point out that safe software running on unsafe hardware is unsafe – something I think is too often forgotten. Further I wanted to make the point that we are not talking about a “bug” in the sense that an intel engineer made a mistake. Rather the cache system generally works as designed – though this design has unintended consequences: cache side channel attacks. Specifically:

Attacker can manipulate the cache: Shared L3 (between cores & trust zones) + Inclusive cache hierarchy
The cache state can be queried and give back a lot of valuable information: Timing attacks (on access & clflush), Shared L3, N-Way-Set associative cache, complex addressing function for L3 that application access to sets.

2) Here I just wanted to point out that this isn’t just an academic issue. In many ways cache side channel attacks, can be a real world alternative to VM breakout and privilege escalation attacks – or even a method to support either of those.

3) I wanted to make the point that we can do something about these critters – even when the attacker adopts.

Introduction

The most important things I wanted to get across was CSCs are an attack methodology and not an attack. Further that the cache design is not a requirement to be an x86, that is the cache could be designed differently and be less useful for cache side channel attacks without causing software incompatibility.
The bandwidth of the side channel is essentially the bandwidth of the cache and thus a lot of valuable data can be leaked. The size of the caches makes the retention time of the information high and thus allows us to actually read out the data in most cases (say compare to D’Atoine’s attack on instruction reordering where the information is lost immediately if not queried). Finally, software cannot really avoid using the cache and thus leak information.
Here I wanted to point out that intel is unlikely to provide a fix. They won’t replace old processors, micro code update is unlikely and Intel have brought out CPU’s for a while without fixing. Leading us towards software fixes for a hardware problem. This is a situation which is always suboptimal, but short to medium term the only option. It was my intention to show that while suboptimal entirely within the realm of the possible.

Scope

Litterature: ARM: [11], TLB[9], RowBuffer[10]

L3 is not small

Point here was to give a sense of scale on the importance of the L3 cache in the CPU. Also it’s kind of neat you can actually see slices of the L3 cache in a die shot.

Why is this interesting

The point of this slide was to illustrate the flexibility and power of CSC’s. Important to notice here: CSC’s are always attack on software implementation – not algorithms. An RSA implementation might leak it’s private key, another implementation may not.

Literature: Covert channels [13], RSA[4], AES[6], ECDSA[5],Keyboard[7],Mouse[8], KASRL[9]

How the data cache works on Intel + Important cache features

Here I wanted to introduce the design features that allows the attacker to manipulate the cache hierarchy. The second slide is just a summary slide.

How memory is stored in L3 + N-Way Set Associative L3

Here I wanted to introduce how the cache stores data. This is further back ground for manipulation, but also serves as a back ground for getting information from the cache. The “intel’s problem” is a so called fully associative cache on bytes. Essentially cache line concept limits accuracy of the attacks and the N-Way set associative cache allows an attacker to work on much smaller units making attacks practical. Further it allows an attacker to deduct information from cache-set congruence, something that is supported by the complex addressing function for the cache. I avoided the complex addressing function deliberately to avoid complexity and just mentioned that “the colors are not clumped together, but spread out”. Cache side channel attacks has historically only approximated this complex addressing function by doing a simple timing attack – loading N+1 addresses from the same set will cause one not to be cached and thus accessing it will be slow. The complex addressing function for cache sets has been reverse engineered resonantly for most intel CPU’s.

Example code to attack

The example code I use to explain the attacks is inspired by a function in the GDK library which is significantly more complex than this. I wanted to underline that code as well as data can be attacked and that branches matter. I wanted to speak about the big 3 attacks in the following slides by highlight some differences in a concrete example. The GDK library was first attacked by [7].

Common for all CSC

Here the goal was to introduce timing attack and all the cache manipulation primitives of the big 3 caches. Evict is just accessing enough cache lines congruent to a cache set to cause all previously stored information to be evict. Prime is a variation on Evict – where you carefully access cache lines congruent to a cache set in a way that the attacker knows what is in the cache set. The flush primitive is simply the clflush instruction, that removes a specific cache line from the cache hierarchy.

Big 3 Cache side channel Attacks

I wanted to comment that the 3 attacks and the variants of them make up most CSC’s and that they are all fairly similar in structure. Also I wanted to point back to the primitives described in the previous slide.

Literature: Evict+Time[1], Prime+Probe[2], Flush+Reload[3]

“The 4 slides on the attacks”

In these slides I wanted to highlight some advantages and disadvantages for each attack. The Evict+Time does not give any temporal resolution – which I call “post mortem analysis”, but you don’t need to worry about synchronizing with victim. Synchronizing can be done by any means including a separated CSC, a function call or any other information leak. Or even spying constantly. Though the accuracy is cache congruence it should probably be noted that prime and probe is disturbed by any cache set congruent memory operation whereas Evict+Time is only disturbed by those who evict exactly those cache lines used by the function. However, calling a function tends to bring more non-relevant code into play and thus noise. Noise is a more complex topic that the impression the overview slide gives.

Shared memory

I heard the „nobody shares memory“comment one too many times and wanted to raise a flag that it isn’t that uncommon. Finally, I wanted to warn against shared memory – particularly as it’s introduced through deduplication as that’s the most common vector in VM environments. Flush+Reload is just an example of the problems with dedublication, but one can can pile on with D’Antoine’s instruction reordering attack, dedublication attacks, more speculatively row hammer can become a practical cross VM vector with deduplication, row buffer side channel attacks etc. etc.

Noise

I wanted to point out that CSC’s are noisy. Usually the noise is due to contention with irrelevant code running on multicore CPU’s or contention in other subsystem than the cache. Also the hardware prefetcher can destroy things for an attacker. Of these effects only the effect of the hw prefetcher cannot be eliminated by repeating the attack – though obviously not all attacks lend themselves to be repeated (you cannot ask a user for his password 10k times). Sometimes you get good results in the first attempt. I had an Evict+Time attack that required more than 10k attempts to give satisfying results.

Performance counters

My „agenda“on my blackhat talk last year was to communicate that performance Counters can be an important tool for security related weirdness in the CPU. Part of this slide is an attempt to repeat that message. The important part was to introduce performance counters, performance counter interrupt and setting up the counter as they form an important part of my work on detecting cache side channel attacks.

Flush + Reload code

Here the Clflush is used as a “manipulate cache to a known state primitive”.

The Mov instruction (which could be replaced by most memory access instructions) is the reload phase. The RDTSC(P) does the actual timing of the reload phase and the mfence instructions prevents that the CPU reorder the instructions to do the reload phase outside the rdtsc(p) bracket.

My comments read that I should explain that cache side channel attacks like flush+reload is in a race condition with the process being spied upon.–Say if we’re attacking keyboard input we’ll be less visible if we wait a few milliseconds per iteration because nobody types that fast whereas for crypto we almost always need much higher temporal resolution and usually wouldn’t wait at all.

Detecting flush+reload

My original suggestion was to see how long a given count of cache misses would take. If too fast we had a cache side channel attack. [12] and [13] improved on that. All works fairly well.

Row hammer was flush + reload

Just noting here that if we remove the information acquisition and do it for two addresses at once we have the original row hammer code. It’s then easy to see that row hammer was a flush+reload attack. The “was” word was carefully chosen. Others has shown that the movnti instruction is a vector for row hammer too, and that vector is not a row hammer related attack. To round off my introduction I hope I mentioned that rowhammer.js was a flush+reload variation that I (and others) usually call Evict+Reload using the eviction primitive I discussed in a previous slide.

Flush + Flush

The back story here is I’d figured that clflush would leak information about the cache state and when approached by Daniel Gruss and Clementiné Maurice about detecting a cache attack that causes less cache misses I immediately knew, what they were talking about. Instead of competing to publish I did not do more work in this direction. I did get to review their wonderful paper though.

Flush+flush works with replacing the mov instruction in flush+reload with a clflush but is otherwise identical. The clflush instruction is probably faster when the cache line parameter isn’t in the cache because it’s able to shortcut actually flushing the cache.

Flush+flush has an advantage beyond stealthiness: clflush is faster than mov on uncached memory. Also it leaves the cache in a known state which means the first line of code can be skipped when iterating the attack. This attack is probably the fastest CSC. Also the clflush instruction introduces less problems with the hardware prefetchers. Literature: [13]

Why is flush+flush Stealth

Clflush does not cause cache misses! However, the victim still incurs cache misses due to flush and reload. This usually isn’t sufficient for the flush+reload detection I outline in previous slides to get decent detection rates without incurring significant amount of false positives.

Detecting flush+flush

The first point is an assumption that we’re not talking about a cross VM attack. My opinion is that cross VM flush+flush attacks should be foiled by not using dedublication. It’s also an assumption that I need. In a cross VM attack the attacker could use ring 0 to attack and thus bypass the CR4.TSD access violation. However, it is worth noting that even in this scenario it would make flush+flush more difficult.

The other 3 points are just the technology I use to catch flush+flush with.

Stage 1

This is actually a revamped version of my “Detecting flush+flush” blog post. After posting that I had some private conversations where my method got criticized for being too costly performance wise. So I first tried to find a better performance counter. There was an event that actually mentioned CLFLUSH in the offcore counters, unfortunately it was only available on a small subset of microarchitectures that I deemed too small to be worthwhile. I actually tried to see if intel just changed documentation and it appears they really revamped that event for other purposes. Then I played around with the CBO events in the cache system, though I could get better than cache-ref it was at the cost of being gameable from an attacker view point. So instead I decided to make a two stage approach – first detect the bracket and then detect clflush instruction. I had two reasons for this approach. The first is to deal with the performance impact of the second stage which can be quite severe. The second is I have a feeling that we’ll see other instruction-timing attacks in the months to come and this two stage model maps well to defending this kind of problem. The reason why this two stage works better is that the RDTSC instruction itself is rare, but in pairs spaced close enough for an attacker to not drown in noise is so rare that I’ve not seen a single benign application causing this in my testing.

Problems?

Using CR4.TSD to make rdtsc privileged affects performance two fold. First it causes a very significant slowdown on the RDTSC instruction with emulating it in an exception handler. However, the RDTSC is rarely used – in particularly rarely used in a fashion where it makes up a significant part of the total execution time and thus I think the performance penalty is acceptable. Much worse is that the RDTSC instruction becomes less accurate which might cause issues. Short of profiling code, I’ve yet to come up with a benign case where such accuracy would be required. I may be wrong though.

The detection obviously hinges strongly around the RDTSC being used as a timer. I’ve yet to come up with a good alternative to rdtsc, but that doesn’t mean it doesn’t exists.

Omissions

The Gory details left out is mostly eviction policy and complex addressing function related stuff. Such as finding eviction sets and priming the cache. There are many suggestions for mitigation – not sharing memory being one that’s very good but incomplete as it doesn’t work for evict+time and prime+probe. There are different ways of adding noise or making the timer less accurate – all in my opinion fundamentally flawed as noise or less timer accuracy can be defeated by law of large numbers. The Cache Allocation Technology can be used to protect against CSC’s – here I’m not a fan because it requires software to be reauthored to be protected and that it has limited coverage. Finally, it’s worth mentioning writing branch-free-cache-set-aware code which actually works, but is difficult and we’re unlikely to be able to demand that all affected software does so.

BlackHoodie and what came after

The demand for information security specialists will experience a growth by more than 50% through 2018, thus our industry is in massive need of talented engineers. On the other hand, the field is rather devoid of female engineers, who seem to frequently scare away from the professions in the security sector.

To encourage female engineers to step up to the challenge of reverse engineering, in 2015 Marion Marschalek decided to organize a free reverse engineering workshop for women, dedicatedly inviting females only. The motivation behind this move was to give female engineers the prospect of a comfortable learning environment; to make them feel entitled to take part, rather, than scare them away. Reverse engineering is considered one of the more complex fields of computer science, good learning material is not always freely available, a steep learning curve in the beginning demotivates a lot of students to get their heads around the materia. Thus the idea to host an event which would support one of infosec’s minorities, the ladies.

The workshop took place in September at University of Applied Sciences St. Pölten, Lower Austria. It was hosted on a weekend, in order to ensure a maximum number of participants could take part.

A total of 15 participants fought their way to rural Austria, which is impressive considering the overall scarcity of female engineers at common security events. It is evident that hosting such an event as a female-only workshop encouraged more female participants to join than there would have been at a general training course. The attendees came traveling from several different countries, many of them on their own expenses.

Prior to the workshop the participants had to complete 4 preparational exercises, in order to familiarize themselves with general concepts of malware reverse engineering. The topic itself is not easy to comprehend, more over, learning to reverse engineer binaries on a single weekend is surely not feasible. The preparational tasks were meant to instruct the participants on how to set up their own malware analysis environment, also the basic concepts of static and dynamic analysis were to be internalized. The tasks included several exercises to complete, but also a list of reading material to cover topics such as malware anti-analysis and runtime packers. Finally the participants had to perform first reverse engineering tasks by steering a debugger through a set of minimalistic binaries.

The workshop itself started off with high goals. During the designated weekend the attendees were confronted with a sample of real world malware, protected by a custom runtime packer and a number of anti-analysis measures. The binary itself was a variant of Win32.Upatre and rather compact, generally suitable for beginners, yet showing all the traits that make everyday malware. Still, a single weekend is a rather short timeframe, thus the attendees were required to put in a lot of energy; and concentration, guts and endurance. After a short recap of prior exercises the class started out on analysing and bypassing protection measures of the malware. These included self modifying code, a breakpoint detection trick and execution of code within a window handler function. After the protection layer the malware entered a decompression phase, followed by the import table reconstruction commonly performed by runtime packed binaries.

The final payload of the malware, a function to download further executable content from a remote server, was left as a homework due to timing constraints. Nevertheless, the attendees showed a lot of interest during the training and were an extraordinarily eager class.

Following the principle that easy content just won’t stick, the training weekend was intentionally not designed to be quickly digested. The attendees got a crashcourse on malware reversing, from where they were free to go on on their own to explore further challenges. The primary intention of the workshop was of course to familiarize the participants with binary analysis, furthermore though they were supposed to take away the credo “yes, I can” when looking at complex tasks in the future.

Indeed, now half a year after the workshop the former BlackHoodie attendees have shown marvellous success stories. Two of them have taken on their first reverse engineering positions with Quarkslab in Paris. One did her first malware research talk at Botconf last year, presenting on botnet analysis, and is going for the next speaking engagement soon; one spoke at RootedCon this year about iOS malware attacking non-jailbroken devices. Two ladies decided to pick up RE as topic for their thesis, one focusing on analyzing threat actor TTPs, one on analyzing the NDIS stack relying on memory images. Finally, an eager participant collected her first two CVEs this year by exploiting BMC Logic’s BladeLogic Server Automation product, presenting the findings at Troopers conference. Needless to say, among the participants are seasoned engineers, who excel in cryptography, software development, incident response, and security management every day.

We have high hopes last year’s BlackHoodie attendees keep up the good work and we are looking forward to the next edition, coming up in fall 2016. The next event will be hosted by Marion Marschalek and Katja Hahn in Bochum at the G DATA Campus. It will again be free of charge and hopefully encourage even more female engineers to wreck their brains over binary analysis.