Covert Shotgun

Automatically finding SMT covert channels

Introduction

In my last blog post I found two covert channels in my Broadwell CPU. This blog post will again be about covert channels. For those unfamiliar a covert channel is a side channel where the attacker has an implant in the victim context and uses his channel to “smuggle information” in and out of the victim context across existing security boundaries.

In this blog post I’ll explore how we can automate finding SMT covert channels. SMT is intel speak for hyper threading.

Before I proceed I should note that one of the two covert channels I found in my last blog passed, the one based on the RdSeed instruction, appears also to have been found by others. You can read about it in D. Evtyushkin & D. Ponomarev [1]. They will be presenting their work on this channel at CCS. Unlike myself they develop the channel fully and discuss mitigations. So if you find this channel interesting their paper is well worth a read.

 

SMT and side channels

For a long time, I’ve wanted to look for side channels in SMT. SMT is the technical name for hyper threading. Back in Pentium 4 Intel figured that most stuff that goes on in a CPU core is used relatively seldom, so they had the idea that you could run two hardware threads in the same core if you duplicate only the most important subsystems of the core. The remaining subsystems would be shared between the two hardware threads. In the event that the two threads simultaneously want to use a shared subsystem, one would just have to wait. In short you gain significant amount of performance, for a little bit of infrastructure. The problem is if you read what I just wrote sideways, it got “side channel” stamped all over it. In fact, there has been a number of example of side channels related to SMT found in the past. For instance, Acıiçmez and J.-P. Seifert [2], D’Antoine  [3] and my last blog post had neat little SMT side channel in the pause instruction – just to mention a few.

 

Covert Shotgun

The pause side channel I found by reading the intel optimization manual. This is the kind of work one usually do to find these critters. Sort of like a sniper looking for targets. But I figured it might be possible to automate the process – more like shooting a shotgun and hope you hit something. Hence my tool for doing so Is called covert shotgun.

The basic idea is if you have 3 instructions. One that you measure the latency of and two which you use to send a signal. The instruction where you measure the latency I call the receiver instruction. Of the two signal instructions one must use a subsystem that is not used by the receiver instruction, whereas the others should not. The receiver’s instructions latency will tend to be smaller when subsystems aren’t shared and we can encode this smaller latency as a 0 bit. To automated the process shotgun picks 3 instructions from a list of interesting instructions, measures latency on receiver instruction while running the two signal instructions respectively.

 

To implement this, I use the keystone assembler engine to generate receiver and signal code. The signal code is optimized to maximize pressure on any subsystem used by signal instruction:

A:

<signal_instruction>

<signal_instruction>

<signal_instruction>

<signal_instruction>

<signal_instruction>

Jmp A:

I repeat the instruction 5 times to minimize the effects of the jmp instruction. The receiver instructions latency is measured like this:

Cpuid

Mfence

Rdtscp

Mov rdi, rax

<receiver instruction>

rdtscp

sub rax, rdi

ret

With a list of instructions, I can pick a receiver instruction and two signal instructions and test the receiver with both signals. If there is a significant difference, I have a manifestation of a covert channel. Covert shotgun uses plain brute force on the list which makes it really slow and stupid  (N2 tests) – but for small lists works well enough.

As for what makes out a significant difference my initial idea was to do a classic P-test on 1.000.000 time samples to determine if there was statistical evidence to conclude there was a channel.  This turned up side channels that probably really does exists, but would be thoroughly impractical due to noise. The modified check is an average timing difference of at least 1 clock cycle or 1 standard deviation whichever is larger. Where I estimate the standard deviation the square root of the sum of the sample variance of the two. This insures should insure that the channel is practical. I acknowledge that the standard deviation estimator isn’t “blue”, but given the ad hoc nature of “practicality measure” it is not an issue.

Now the smart reader would’ve noticed I used the wording “manifestation of a covert channel”. I generally tend to only count something as a separate side channel if the system where the contention is caused is unique. With this method method we are likely to find several manifestations of the same side channel. However, it’s worth noting that different manifestations might have different properties and consequently any manifestation could turn out to be useful. For example, flush+reload and flush+flush would under this definition be the same side channel. But Flush+Flush is faster and is more stealthy than flush+reload. See my previous blog post on cache side channel attacks [4] and Gruss et al [5] on flush+flush.

Also worth noting is that cover shotgun will only find side channels of the instruction timing type, and only when the instruction parameters are inconsequential. Finding a side channel like flush+reload or even flush+flush is not possible in this fashion.

Strictly speaking cover shotgun can find covert channels cross core. The reality of it is that cross core covert channels of the instruction timing type is rare. I’ve only found the RdSeed channel so far.

The hardware

harddiagram

A priori I expect a significant amount of covert channels. To understand my a priori expectation let’s spend a second reviewing some hardware design. Below is a very simplified “map” of the instruction execution logic in my Broadwell intel cpu’ Specifically, I’ve tried to simplify by remove the memory subsystems from the diagram as covert shotgun currently is not usable to analyze this subsystem. Further I’ve sanitized the registers away for simplicity as they are not important for covert shotgun to the extent that registers are duplicated for the two threads.

The first step in execution is parsing instructions into myOp. There is a cache for myOp instructions which could probably be used for cache side channels – if I recall correctly I saw a paper on it. Covert shotgun currently only uses a handful of instructions at once and thus is unlikely to be able to find side channels in this cache. The cache however is particularly interesting as it would give information on instructions in the cache.

Once parsed the myOps are reordered to maximize parallel execution.

The reordered myOPs then enter a queue for the ports. Intel speak for this queue is unified reservation station or sometimes just reservation station. Intel architecture connect individual execution units through these ports.

Each port is capable of handling only one myOp per cycle meaning that it’s fairly easy to run into congestion in the ports.  Each of the 8 ports is connected to up to 7 individual execution units which each executes particular specialized tasks. Some execution units are duplicated and available on multiple ports, some execution units are available on only one port – this very well reflects that the CPU was designed for hyper threading. This would lead us to expect to find a high number of manifestations where the receiver instruction and 1-bit sender instruction are identical – with reordering and multiple execution units making sure that it occurs less than 100% of the time. For 0-bit instructions we’d expect a speedup when running parallel with itself. The main point of congestion is probably through the congestion in the individual execution engines – especially those who are not duplicated.

Until Haswell intel had 6 ports and Haswell added an additional 2 ports, going to a total of 8. This allows further parallelization of execution. I think Sky lake added even more ports and duplicated an integer execution unit for better parallelization’s of integer and vector instructions, but I’m not sure. More parallelization typically means less covert channels. These changes should serve as a reminder that the covert channels you find on say a Broadwell, need not exist on a Sandy Bridge and vice versa even when the instructions exist on both. The microarchitecture is subject to change and these changes has significant effects on covert channels and other micro architectural attacks.

Results

Due to the N2 nature of “covert shotgun” I ran a small list of 12 instructions where some were picked because I suspected covert channels in relation to them, and some were chosen because I found the particularly likely to be boring. The list of instructions

Instruction My reasoning
RdSeed rax Pretty slow, thus like a “0” signal instruction
Pause The most likely  “0” signal instruction
Nop Does it get anymore benign?
Xor eax,eax I used it in my last blog
Lea rax,[4*rax+edi+40960];lea rdx,[8*eax+rdi+409623] I suspected I might get into trouble with the Address generation unit, by using SIB bytes, two instructions and a constant bigger than 2048.
RdRand rax Sounded interesting
Add rax,1
Bts rax,1
Bt rax,1 Wanted to check too near identical instructions against each other
Cpuid An interesting instruction on any day
Movq xmm1, xmm2 Streaming instructions must be tested
Vtestps ymm1, ymm2 More streaming instructions…this time 256bit wide

 

Running these 12 instructions in brute force meant testing for 936 covert channels. In these I found 521 practical covert channel manifestations. That’s a whooping 55%. Now I expected there to a be a lot of manifestations in there, but this magnitude surprised me. Personally I don’t think the detailed results are that interesting. The thing is that I think SMT in and of it’s own should be considered the side channel here as opposed to considering congestion in subsystem below the SMT as the core reason. However, I will give you a summary of a few interesting finds below.

covershotgunmatrix

RdSeed turned out to be an extraordinary good “0” signal bit. All instructions run faster in parallel with this instruction than other instructions. No less than 46 of the covert channel manifestation was with RdSeed. With the RdSeed taking around 300 Clk’s to finish, it is likely that this instruction to a very large extend does not interfere with the execution of other instructions, leaving every shared resource mostly free for maximum speed. The pause instruction has the same effect, but not quite as strong and not quite as surprising as RdSeed (Part of the reason for the pause instruction is to yield for the other hw thread). Most likely it’s because the pause instruction has significantly lower latency on my computer and thus more overhead than the RdSeed. If I recall correctly the pause instruction pauses significantly longer on skylake. Interestingly enough the RdSeed instruction also have a significant speedup effect on itself. In my estimation it’s a faster speed up than can be explained with in the pipeline and the most likely reason for the speed up is due to the “seeding unit” being busy and the instruction thus terminating early.

Most of the 1-bit signal instruction instructions does seem to have a high latency when paired with itself than with the average instruction in my test. With the limited number of instructions and the serious selection bias involved I advise against any conclusion.

Unfortunately, I messed up the “lea” test in that I had dependency between the two lea’s and consequently I need to rerun the test. Rerunning the test with lea’s without dependencies did not appear to change the results significantly.

Future research

It would be interesting to run covert shotgun (Preferably made a bit smarter) on much larger lists of instructions to get a much better overview of the situation. Another interesting project would be identifying the subsystem which are being congested by specific instructions. A good starting point would be Agner Fog’s instruction tables[6] and the associated software. This software leverages performance counters to identify port usage and lots of other juicy instruction information.

Further it would be interesting to investigate to what extend these covert channels extend to spying. Especially I found the covariance in instruction timing between different instructions really interesting – there certainly are patterns in there. They suggest that you have congestion on multiple subsystems and perhaps – just perhaps – there is enough data in that to finger print code across SMT. That would be especially interesting say in a SGX scenarios or would lead directly to spying in branched victim code.

A word on microarchitecture timers

Most side channel attacks are timing attacks and thus subject to mitigation through modified timers either by intention or in cases where a hypervisor for other reasons decided to get a VMExit on Rdtsc instructions. For this reason, I’ve been searching for a fast timer that isn’t rdtsc and so far not found one. A natural candidate would be a fast loop synchronized through either overt or covert access. Overt access through memory access was in my experiments too slow to useful. Probably the reason is that a store operation is required that triggers a cache coherency action. So I wanted to see if the RdSeed channel could be used. Unfortunately, the answer is a clear no with RdSeed taking upwards of 200 CLK’s and micro architectural attacks often require accuracy below 70 CLK’s. That said – I have this other idea which I hope I’ll get around to soon

Conclusion

Despite my limited time with covert shotgun it’s has shown great potential. Also I think this simple experiment efficiently shows the dangers of SMT if it wasn’t clear to everybody already. In my opinion it’s justified to talk about SMT being the core cause of all these side channels (and other SMT channels as well) and consequently it’s not very useful to talk about “a pause side chance” as a separate side channel, but as a manifestation of the SMT side channel. With the number of manifestations and different available subsystem the only reasonable mitigation left against these covert channels seems to turn off SMT entirely in threat models that includes attackers having an implant in your VM. Personally I think the root cause is that the attacker is in your VM in the first place. If these methods can be used for spying it is very likely that some can be implemented in java script and thus be used for remote attacks.

 

[1] Dmitry Evtyushkin, & Dmitry Ponomarev , “Covert Channels through Random Number Generator: Mechanisms, Capacity Estimation and Mitigations”hhttp://ttp://www.cs.binghamton.edu/~dima/ccs16.pdf

[2] O. Acıiçmez and J.-P. Seifert., “ Cheap Hardware Parallelism Implies Cheap Security. Fault Diagnosis and Tolerance in Cryptography” (FDTC’07), pages 80-91, IEEE Computer

Society, 2007.

[3] Sophia M. D’Antoine, “Exploiting processor side channels to enable cross VM Malcious Execution”. http://www.sophia.re/SC/thesis.pdf

[4] Anders Fogh, “Cache side channel attacks”,http://dreamsofastone.blogspot.de/2015/09/cache-side-channel-attacks.html

 

[5] Daniel Gruss, Clémentine Maurice & Klaus Wagner(2015): “Flush+Flush: A Stealthier Last-Level Cache Attack”, http://arxiv.org/pdf/1511.04594v1.pdf

[6] Fog, Agner, “4. Instruction tables “ http://www.agner.org/optimize/instruction_tables.pdf

 

Two covert channels

Introduction

Too much WinWord. Too much Tex. Too many meetings. Too little CPU. It was time for a short pause from the grind and dig into some tetravalent metalloid. My current project was too big a mouthful to get into before going to Black Hat, so I dug up a pet project to play around with. And then it happened – I needed some info from the intel documentation and before you know it had what I believe is a new side channel in my head. Then a second one. Then a third. Two of these side channels looked like they would make for good arguments for a point I’ve been meaning to make. The third is more complex and it might be fairly useful. Consequently, I decided to write this small blog post on the two first side channels and make my point. Both of these side channels are technically interesting, however only the second is likely to be of any practical importance. I have not spent too much time or effort researching these channels as they are just a side project of mine. Beware: I’ve not checked the literature for these two particular channels, so I might be repeating other people’s research. I do not recall having read anything though.

Not all side channels are equal

Not all side channels are created equal. Some are inherently more powerful than others. However, there is no good way to measure how good a side channel is – it simply often comes down to the subjective question of what do an attacker need the channel for.

The simplest use for a side channel is the so called covert channel. Most microarchitecture side channels can be used to communicate between two trust domains when an attacker has access to both. An attacker can use covert channels for data exfiltration. The most important two reasons are to stay hidden or because regular channels are not open. An example of a covert channel would be an implant on a virtual machine in the cloud that communicates through, say a cache side channel, with another VM on the same physical computer. In this scenario the attacker doesn’t leave any TCP/IP logs and can communicate with his C2 server despite the defender turning off his network adapter.

While most microarchitecture side channels can be used as a covert channel, some side channels provides insights into normal system behavior and thus are powerful enough to be used for spying even when the attacker does not have access to both attacker and victim’s trust domain.

In this blog post we’ll be looking at two side channels which falls into this category of side channels, which are mostly irrelevant for spying but useful for covert channels. Since these channels are probably limited in this way I have refrained from researching them very deep. The point of this blog isn’t the two new side channels, but a different one. However both side channels are  from a technical view point interesting. The second side channel might even be practical.

 

Time for a pause – side channel with the pause instruction

Being old one forgets the name of the instructions from time to time and I needed a pause instruction and was searching for “wait”. Anyhow I picked up the optimization guide because I know the right name would be in there. And so it was, but then I read the description again and found the pause instruction has a profound impact on hyper threads because it does not significantly block resources of the other hw thread on the computer. But that has probably been utilized by somebody already. But more interesting if two hyper threads both use the pause instruction simultaneously power can be saved.  One has to wonder if that makes difference in the execution time of the pause instruction. And it does. But the story is longer. The latency of the pause instruction is heavily dependent on power management settings and the current activity of the computer. So the channel depends on those settings. Further Intel has been tinkering with the pause latency, so you’re likely to see different results on different computers. My results are from an I3-5005U, I did not test on other computers. My power settings where “Balanced”. Notice we are talking about a laptop here, which might be important too. Below a plot of latency measurements of the pause instruction with a core-co-located thread doing a loop of pause instructions and with a identically co-located thread doing a normal integer operation (xor eax,eax – did test with other instructions with similar results). The plot is for a loop of 40.000 measurements.

pause_latency

Usually we’d expect a speed up when a collocated hw-thread would pause, because the thread effectively yields all resources to the other hw-thread in the core and in the meassurements there is a bit of overhead that could be sped up. However we are seeing the exact opposite. I can only speculate about the reason, but it seems likely that with both co-located hw-threads in pause mode, the entire core goes into pause mode.That is the c-state of the core is increased and this comes at a small latency cost. Whatever the cause is, we have a measurable side channel. There is not much to spy on with this channel we could possibly spot when the kernel is doing a spinlock for some reason, but really juicy information does not seem to be a likely catch in the side channel. Since the channel depends heavily on power settings and on computer activity the channel might not be very useful in practice. Further the core-colocation requirement obviously detracts from value of the side channel.  Also it’s a timing side channel using rdtsc(p) instruction which gives leverage for detection as well as mitigation through timing inaccuracy. In fact this attack would be difficult to get to run under VirtualBox (Ubuntu host) as this virtual machine is setup to cause a vmexit on rdtsc which makes the measurement of such small time differences difficult (Though not impossible). Even worse for the value of this side channel is that pause itself can be made to cause vmexit and thus be used for direct detection/mitigation of any pause based side channel attack. So while we have a side channel here, it’s not likely to be a very practical one.

 Sowing a seed too many – side channel with rdseed

“Under heavy load, with multiple cores executing RDSEED in parallel, it is possible for the demand of random numbers by software processes/threads to exceed the rate at which the random number generator hardware can supply them. This will lead to the RDSEED instruction returning no data transitorily.”

 

That sounds like a very obvious side channel indeed. The RDSEED instruction has been added with the Broadwell micro architecture and uses heat entropy in the CPU to generate cryptographically strong random number. The problem with this instruction seems that for enough entropy to build up a bit of time is needed between calls to RDSEED. Thus intel designed the instruction to return an “error” using the carry flag if insufficient time has passed since the last call of RDSEED. So the basic idea is that an attacker could create a covert channel using this instruction. To send a 1 bit the sender implant loops an rdseed instruction and mean while the receiver runs a loop spaced with plenty of time between rdseed. The information is extracted in the recievesr’s end from a count of failed rdseed instructions. My simple test setup was an infinite sender loop which either called the rdseed instruction or not depending on the bit I wanted to send. My receiver looped 1000 times around did an rdseed instruction followed by a  10 ms Sleep() call.  0 bits caused zero failures in the receiver loop and typically around 800 failures in the 1bit scenario.  I tested only on a I3-5005U Broadwell laptop, but with the sender and receiver thread pinned on same core as well as on different cores. Results where near identical.

 

This particular channel is interesting in so many ways, despite its limited use. It is a really rare thing that side channels in the CPU are not timing attacks based around the rdtsc(p) instruction. Infact I know of only one other attack: The instruction reordering attack of Sophia D’Antoine which if I understand it correctly is limited to scenarios that are not really reflective of the real world. This attack however works cross core without other limiting factors – however it’s limited to micro architectures that support the instruction which is only the newest intel CPU’s. Further it does not appear to cause any interrupt that would allow instrumentations and does not appear to be wired to performance counters that would allow detection thus making low impact software mitigation difficult. Finally, theoretically the channel could be used to spy on benign usage of the rdseed instruction, but probably wouldn’t gain much information.

The real point

The point of this blog isn’t that I found two new side channels. The point here is an entirely different one. The point is that I as a byproduct found at least two side channels in the processor of which at least one does not seem like it’s fixable in software. It might be in microcode, but I don’t think that’s relevant either.  The point I wish to make is that with the amount of side channels already known and me finding two while not looking for them, it seems likely that there’ll be a great many side channels waiting to be discovered. Not to mention the likelihood that Intel will with time add new ones (recall the second side channel arrived with the Broadwell microarchitecture). Even if we knew all the side channels we’d be hard pressed to come up with software detection/mitigation and it seems unlikely that Intel will fix these side channels as they’ve not done so in the past. Consequently we should probably accept the fact that covert channels in cloud computing is a fact of life and consider them an unavoidable risk. This of cause does not mean we shouldn’t continue to find and document the channels, since our only defense against them is if malware analysts are able to spot one from a malicious implant.

 

 

BlackHoodie #2 – We roll again :)

Last year I held a free reverse engineering workshop for women, mainly in the not entirely un-selfish interest to see more of them around in the whole security field. More about the motivations, the why’s and obstacles and how it turned out you can read up here, here and here. Looking back, I’m super happy with this little project and, leaning out the window a bit further, call it a big success.

That said, I gleefully announce BlackHoodie #2, the next women-only reversing workshop, to take place in Bochum, Germany the weekend of 19th + 20th of November 2016. This edition will be held in cooperation with Katja Hahn, a splendid binary analyst herself, and Priya Chalakkal, an up-and-coming hacker of all things; and comply to the same principles as last year. It will be free of charge, no strings attached, and aim to help femgineers entering a field thats not easily accessible.

Moreover, in a wonderful initiative community members announced their support of this year’s edition by covering the travel expenses of a BlackHoodie attendee. The US startup Iperlane and Thomas Dullien aka. Halvar Flake will cover the trip for a lady, who decides to come join the workshop. The lucky attendee will be randomly selected from the group of registered participants.

May there be oh so many participants🙂🙂 So here we go again..

Why women only?

Because a girl-to-girl conversation is so much more fruitful than a full classroom with only one or two women hiding in the corners. I’ve done so many things in my life where I was the *only* girl among X other participants, and I promise I’ve been hiding in the corners more than once.

For the gents it might not be that obvious, but it is not easy for young females who haven’t yet found their place in life to walk into a class room, a university lecture, an office or a conference room full of men. Who, generally speaking, very often very well seem to know their place.

I’ve had girls in my classes before, hiding and holding back although I am so certain they would have been capable to be so much better than what their final results showed. So yeah this will be women only, for every female should feel welcomed and encouraged to do her best and get the most out of it.

Why more women in low-level technical jobs in general?

  • It’s difficult. Mastering something difficult makes you happy. I want all of you to be happy.
  • It pays well. While money makes you also happy, what’s more important, it gives you courage and independence.
  • It keeps you busy. Lots of open job positions globally, even better, believe it or not it is addictive and you might even find yourself a new hobby.

Hardfacts

  • Its gonna be Katja, and Priya, and me, and a binary, and you, and plenty of debuggers
  • Online preparation assignments, 4 of them, over the course of two months prior to the workshop
  • Workshop 19./20. of November at G DATA Academy, Bochum Germany
  • No fees, no strings attached, all you have to do is get there
  • Please register with your name or nickname and a short note about your background at blackhoodie at 0x1338 .at

Prerequisites

  • Being female
  • Computer science background in a sense you understand programming logic, how a processor works and how an operating system works
  • A Notebook capable of running at least one virtual machine
  • A virtual machine, preferred WinXP 32-bit
  • Guts🙂 (It is going to be a lot to learn in a very short time)

REGISTRATION:

Please register with your name or nickname and a short note about your background at blackhoodie at 0x1338 .at. About two weeks before the event you will be asked for a final confirmation of your participation.

Row hammer the short summary

 

Introduction

This is the first updated version of my original “Row hammer the short summary” blog post. As I had predicted the summer was going to be interesting in terms of row hammer and it certainly appears I was right about that. With so much going on I found it worthwhile updating this blog post to be in line with the latest developments and to fix up a few minor details.

 

Short version of how dram works.

Current DRAM comes in modules called DIMM’s. If you buy a modern memory module for your PC, you’re buying a DIMM. If you look at the DIMM most DIMMs will have chips on both sides. Each side of the DIMM is a rank. Each rank again consists of a number of banks. The banks are in the physical individual chips you can see. Inside a bank you’d find a two dimensional matrix of memory cells. There are 32k rows in the matrix and 16k or 512k cells per row.  Each cell stores one bit and consists of a transistor for control and a capacitor which stores charge to signify bit is equal to 1 and no charge when bit is equal to 0 (on some chips the encoding is reversed). Thus a row stores 8kb or 64kb of data depending on the exact kind of DRAM you have in front of you. When you read or write from/to DRAM an entire row is first read into a so called so called row buffer. This is because for reading automatically discharges the capacitor and since writes rarely rewrite the entire row. Reading a row into the row buffer is called activating the row. An active row is thus cached in the row buffer. If a row is already active, it is not reactivated on requests. Also to prevent the capacitors loose charge overtime they are refreshed regularly (typically every 64 ms) by activating the rows.

 

Row hammer introduction

This section is based on [1] Kim et Al. where not otherwise noted.

When a row is activated a small effect is caused on the neighboring row due to so called cross talk effects. The effect can be caused by electromagnetic interference, so called conductive bridges where there is minor electric conductivity in dram modules where it shouldn’t be. And finally, so called hot-carrier-injection may play a role where an electron reaches sufficient kinetic energy where it leaks from the system or even permanently damage parts of the circuitry.  The net effect is a loss of charge in the DRAM cell which if large enough will cause a bit to flip.

Consequently, it’s possible to cause bits to flip in DRAM by reading or writing repeatedly and systematically from/to two rows in DRAM to (active the rows) bit flips can be introduced in rows up to 9 rows away from these two “aggressor rows”. The 9 rows are called victim rows. The most errors happen in the row immediately next to an aggressor row. Picking the aggressor rows so they bracket a victim row is called double sided row hammering and is far more efficient that normal row hammering. Using two adjacent rows to hammer surrounding rows is called amplified single sided hammering and can be useful in exploitation scenarios. If the victim rows are refreshed before enough cross talk can be generated no bit flips is incurred. As a rule of thumb the higher the frequency of row activation the higher the probability of flipping bits.

It has been shown that bits can be flipped in less than 9 milliseconds and typically requires around 128k row activations. [3] Seaborn & Dullien has reported bit flips with as little as 98k row activations.

The problem occurs primarily with RAM produced after 2010. In a sample of 129 RAM modules from 3 manufacturers over 80% where vulnerable. With all modules produced after 2012 being vulnerable.  [4] Lanteigne showed that DDR4 ram is vulnerable too with 8 out of 12 sampled DRAMs was subject to bit flips. Further this paper showed that certain patterns in the DRAM rows where more likely to cause bit flips.

[21] Lateigne concludes that AMD and Intel CPU’s are both capable of row hammering, but that the most bit flips are encountered when the methodlogy is adapted to the underlying memory controller in the attacked system.

 

Physical addressing and finding banks and row

Obviously to cause row hammering one needs two addresses belonging to rows in the same bank. [2] showed that repeatedly choosing two random addresses in a large buffer would in a practical amount of time belong to the same bank and thus be useful for hammering in software.

An optimal solution requires that the attacker has knowledge of physical addresses. Even with a physical address an attacker would need to know how they map to dimm, banks and rows to optimally row hammer. [5] Seaborn used row hammer itself to derive the complex function that determines physical address to dram location for a sandy bridge computer. [6] Pessl et al. showed how to use “row buffer side channel attacks” to derive the complex addressing function generically and provided maps for many modern intel CPU’s.

To obtain physical addresses the /proc/$PID/pagemap could provide this information. However, /proc/$PID/pagemap, which is not available in all operating systems and no longer affords unprivileged access in most operating systems that do support it. This problem for an attacker remains to be definitively solved.

 

Row hammering with Software

To cause row hammer from software you need to activate memory rows, that is cause reads or writes to physical memory. However modern processors are equipped with caches, so that they don’t incur serious speed penalties when memory is read or written. Thus to cause row hammering bit flips it’s required to bypass the caches.

[1] did this using the clflush instruction that removes a particular address from the cache causing the next read of the address to go directly to memory. This approach has two down sides. First, since clflush is a rare instruction, validator sandboxes (such as NaCL of google chrome) can ban this instruction and thus defeat this attack. Second Just-in-time compilers and existing code on computers generally do not use this opcode disabling attack scenarios where jit compilers are used (such as javascript) or for the future using existing code in data-only attacks.

[7] Aweke posted on a forum he was able to flip bits without clflush – he did not say how, but it was likely using the same method as [8] which systematically accessed memory in a pattern that causes the processor to evict the address of the attacker row from the cache causing the next read to go to physical memory. Unfortunately, how to evict caches is CPU dependent and undocumented and despite [8] Gruss, Maurice & Mangard mapping out how to optimally evict on most modern CPU’s it’s not the most trivial process. Typically, this requires knowledge of the physical address discussed above as well as a complex mapping function for cache sets. It is however possible to approximate this either through using large pages or through timing side channels. Also it is slower and thus less efficient than the clflush version above. Since this vector does not require special instructions, it’s applicable to native code (including sandboxed code), java script and potentially other JIT compiled languages.

[9] Qiao & Seaborn found out that the movnti instruction is capable of by passing the caches on it’s own. Further this instruction is commonly used – including as memcpy/memset in common libraries and thus difficult to ban in validator sandboxes and lowers the burden for future row hammer as a code reuse attack. It remains to be seen if JIT compiled languages can make use of it.

Finally, [10] Fogh showed that the policies that maintains the coherency of multiple caches on the CPU can be used to cause row activations and speculated it would be fast enough for row hammering. Since the coherency policies are subject to much less change than cache eviction policies and does not require any special instructions this method may solve problems with the other methods should it be fast enough. This remain to be researched.

 

Exploiting row hammer

[2] showed that row hammer could be used to break out of the NaCL chrome sandbox. The NaCL sandbox protects itself against by verifying all code paths before execution and block the use of undesired activities. To avoid new unintended code paths to be executed the sandbox enforces a 32 bit aligned address for relative jumps and adding a base address. In code it looks like this:

and rax, ~31a

add rax, r15  //(base address of sandbox)

jmp rax

Bit flips in these instructions often cause other registers to be used and thus bypass the sandbox enforcing limits on relative jumps. By spraying code like the above then row hammer, check for usable bit flips and finally use one of these relative jumps to execute a not validated code path and thus exit the sandbox. Not validated code path can be entered through code reuse style gadgets.

The second and more serious privilege escalation attack demonstrated by [2] was from ring 3 user privileges to ring 0. Since adjacent physical addresses has a tendency to be used at the same time, CPU vendors map adjacent physical addresses to different parts of RAM as this offers the possibility of memory request being handled by DRAM in parallel. This has the effect that banks are often shared between different software across trust boundaries. This allows an attacker to flip bits in page table entries (PTE). PTE’s is central in the security on x86 and controls access writes to memory as well as mapping between virtual and physical addresses.  By repeatedly memory mapping a file many new PTE’s are sprayed and the attack has a good chance of hitting by random using row hammer. The attacker hopes that a bit flips so that a PTE with write privileges maps to a new physical address where another PTE is located. By strategically modifying this second PTE the attacker has read & write access to the entire memory.

[18] Bhattacharya & Mukhopadhyay. Used to Row Hammer to extract a private RSA 1024 bit key. Their attack used a combination of PAGEMAP, a cache side channel attack (prime+probe) and a row buffer side channel attack to find the approximate location in physical memory of the private key. Once located row hammer is used to introduce a fault in the private key, and fault analysis is then used to derive the real private key. This make the attack somewhat unique in that it’s the only attack so far that does not rely on any kind of spraying.

[19] Bosman et. Al. Breaks the Microsoft Edge’s javascript sandbox . First they use two novel dedublication attacks to gain knowledge about the address space layout. This allows them to create a valid looking, but counterfeit java object in a double array. They then find bit flips by allocating a large array and using single sided row hammering. The method used is similar to [8] but they also notice and exploit the fact that pages 128k appart are likely to be cache set congruent. Once they know where the bit flips are they can place a valid object at this address, that is so crafted that the bit flip will change it to a reference to the counterfeit object. Once this is set the row hammering is repeated and they now have a reference for the counterfeit object that can be used by compiled javascript. Further the object can be edited through the double array in which it was created and this allows arbitrary memory read and write.

[20] Xiao et al. The content of this paper is unknown to me, yet the title suggests that cross-VM and a new kind of privilege escalation is possible with row hammer.

It is likely that row hammer can be exploited in other ways too.

 

Row hammer mitigation

Most hardware mitigations suggest focuses on refreshing victim rows thus leaving less time for row hammer to do its work. Unfortunately, during the refresh ram is unable to respond to requests from the CPU and thus it comes at a performance penalty.

The simplest suggestion is increase the refresh rate for all ram. Much hardware support this already for high-temperatures. Usually the refresh rate is doubled, however to perfectly rule out row one would need to increase the refresh rate more than 7 fold [1]. Which in term is a steep performance penalty and a power consumption issue.

TTR [17] is a method that keeps track of used rows and cause targeted refresh of neighbors to minimize the penalty. The method needs to be supported in both CPU as well as RAM modules. MAC also known as maximum activation count keeps tracks of how many times a given row was activated. pTTR does this only statistically and thus cheaper to build into hardware. PARA [1] is another suggested hardware mitigation to statistically refresh victim rows. ARMOR [16] a solution that keep tracks of row activation in the memory interface.

It has been suggested that ECC ram can be used as a mitigation. Unfortunately, ECC ram will not to detect or correct bit flips in all instances where there are multiple bit flips in a row. Thus this leaves room for an attack to be successful even with ECC ram. Also ECC ram may cause the attacked system to reset, turning row hammer into a Denial of Service attack. [4] Suggests this problem persists in real life experiments. Keeping track of ECC errors may however serve as an indication that a system was under attack and could be used to trigger other counter measures.

Nishat Herath and I suggested using the LLC miss performance counter to detect row hammering here [11] Fogh & Nishat. LLC Misses are rare in real usage, but abundant in row hammering scenarios. [12] Gruss et al. ,[13] Payer refined the method respectively with correcting for generally activity in the memory subsystem. The methods are likely to present false positives in some cases and [11] and [13] therefore suggested only slowing down offenders to prevent bit flips. [14] Aweke et al. presented a method that first detected row hammering as above, then verified using PEBS performance monitoring, which has the advantage of delivering an address related to a cache miss and thus grants the ability to selectively read neighbor rows and thus doing targeted row refresh in a software implementation. [15] Fogh speculated that this method could be effectively bypassed by an attacker.

Litterature

[1] Yoongu Kim, R. Daly, J. Kim, C. Fallin, Ji Hye Lee, Donghyuk Lee, C. Wilkerson, K. Lai, and O. Mutlu. Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors. In Computer Architecture (ISCA), 2014 ACM/IEEE 41st International Symposium on, pages 361–372, June 2014.

[2] Mark Seaborn and Thomas Dullien. Exploiting the DRAM rowhammer bug to gain kernel privileges. March 2015.

[3] Mark Seaborn and Thomas Dullien. “Exploiting the DRAM rowhammer bug to gain kernel privileges”. https://www.blackhat.com/docs/us-15/materials/us-15-Seaborn-Exploiting-The-DRAM-Rowhammer-Bug-To-Gain-Kernel-Privileges.pdf

[4] Mark Lanteigne. “How Rowhammer Could Be Used to Exploit Weaknesses in Computer Hardware

“. Third  I/O. http://www.thirdio.com/rowhammer.pdf

[5] Mark Seaborn.” How physical addresses map to rows and banks in DRAM”;

[6] Peter Pessl, Daniel Gruss, Clémentine Maurice, Michael Schwarz, Stefan Mangard: „Reverse Engineering Intel DRAM Addressing and Exploitation“

[7] Zelalem Birhanu Aweke, “Rowhammer without CLFLUSH,” https://groups.google.com/forum/#!topic/rowhammer-discuss/ojgTgLr4q M, May 2015, retrieved on July 16, 2015.

[8] Daniel Gruss, Clémentine Maurice, Stefan Mangard: “Rowhammer.js: A Remote Software-Induced Fault Attack in JavaScript”

[9] Rui Qiao, Mark Seaborn: “A New Approach for Rowhammer Attacks”.http://seclab.cs.sunysb.edu/seclab/pubs/host16.pdf

[10] Anders Fogh: “Row hammer, java script and MESI”http://dreamsofastone.blogspot.de/2016/02/row-hammer-java-script-and-mesi.html

[11] Anders Fogh, Nishat Herath. “These Are Not Your Grand Daddys CPU Performance Counters”. Black Hat 2015. http://dreamsofastone.blogspot.de/2015/08/speaking-at-black-hat.html

[12] Daniel Gruss, Clémentine Maurice, Klaus Wagner, Stefan Mangard: “Flush+Flush: A Fast and Stealthy Cache Attack”

[13] Mathias Payer: “HexPADS: a platform to detect “stealth” attacks“. https://nebelwelt.net/publications/files/16ESSoS.pdf

[14] Zelalem Birhanu Aweke, Salessawi Ferede Yitbarek, Rui Qiao, Reetuparna Das, Matthew Hicks, Yossi Oren, Todd Austin:”ANVIL: Software-Based Protection Against Next-Generation Rowhammer Attacks”

[15] Anders Fogh: “Anvil & Next generation row hammer attacks”. http://dreamsofastone.blogspot.de/2016/03/anvil-next-generation-row-hammer-attacks.html

[16] http://apt.cs.manchester.ac.uk/projects/ARMOR/RowHammer/armor.html

[17]  http://www.jedec.org/standards-documents/results/jesd209-4

[18] Sarani Bhattacharya, Debdeep Mukhopadhyay: “Curious case of Rowhammer: Flipping Secret Exponent Bits using Timing Analysis”. http://eprint.iacr.org/2016/618.pdf

[19] Erik Bosman, Kaveh Razavi, Herbert Bos, Cristiano Giuffrida “Dedup Est Machina: Memory Deduplication as an Advanced Exploitation Vector”

[20] Yuan Xiao, Xiaokuan Zhang, Yinqian Zhang, and Mircea-Radu Teodorescu:”One Bit Flips, One Cloud Flops: Cross-VM Row Hammer Attacks and Privilege Escalation”.  To be published

[21] Mark Lateigne: “A Tale of Two Hammers. A Brief Rowhammer Analysis of AMD vs. Intel”. http://www.thirdio.com/rowhammera1.pdf

Cache side channel attacks: CPU Design as a security problem

 

End of May I had the opportunity to present my research on cache side channel attacks at the “Hack In The Box” conference. After my presentation with Nishat Herath last year at black hat I published my private comments to that slide deck and that was well received. I had decided to do that again for “Hack In The Box”, unfortunately it took me a little longer to translate my comments into something human readable. But here they are. Since the comments relate directly to a specific slide in the slide deck you’ll probably want to have the slide deck open when reading this blog post. You can find them here: https://conference.hitb.org/hitbsecconf2016ams/materials/D2T1%20-%20Anders%20Fogh%20-%20Cache%20Side%20Channel%20Attacks.pdf

Cache side channel attacks: CPU Design as a security problem

The title of the presentation took quite a while to figure out because I wanted a title that fairly and accurately described the content of the presentation, but one that was also slightly catchy. I like what I came up with.

Here I told how I got into cache side channel attacks as an introduction. I was working on a talk about row hammer detection, when Daniel Gruss tweeted he’d been able to row hammer in Java script. Consequently, I had to figure out how he did it, so that I could speak with confidence. Everything pointed towards the cache side channel attack literature, so I dug in and 4 hours later produced my first cache side channel attack. This side channel attack triggered my row hammer detection and that let me into a year worth of playing with CPU caches.

 

Agenda

When I start doing slides, I always think about what I want to come across to the listener well knowing that most people will have forgotten most two hours later.

1)      I wanted to point out that safe software running on unsafe hardware is unsafe – something I think is too often forgotten. Further I wanted to make the point that we are not talking about a “bug” in the sense that an intel engineer made a mistake. Rather the cache system generally works as designed – though this design has unintended consequences: cache side channel attacks. Specifically:

  1. Attacker can manipulate the cache: Shared L3 (between cores & trust zones) + Inclusive cache hierarchy
  2. The cache state can be queried and give back a lot of valuable information: Timing attacks (on access & clflush), Shared L3, N-Way-Set associative cache, complex addressing function for L3 that application access to sets.

2)      Here I just wanted to point out that this isn’t just an academic issue. In many ways cache side channel attacks, can be a real world alternative to VM breakout and privilege escalation attacks – or even a method to support either of those.

3)      I wanted to make the point that we can do something about these critters – even when the attacker adopts.

Introduction

  • The most important things I wanted to get across was CSCs are an attack methodology and not an attack. Further that the cache design is not a requirement to be an x86, that is the cache could be designed differently and be less useful for cache side channel attacks without causing software incompatibility.
  • The bandwidth of the side channel is essentially the bandwidth of the cache and thus a lot of valuable data can be leaked. The size of the caches makes the retention time of the information high and thus allows us to actually read out the data in most cases (say compare to D’Atoine’s attack on instruction reordering where the information is lost immediately if not queried). Finally, software cannot really avoid using the cache and thus leak information.
  • Here I wanted to point out that intel is unlikely to provide a fix. They won’t replace old processors, micro code update is unlikely and Intel have brought out CPU’s for a while without fixing. Leading us towards software fixes for a hardware problem. This is a situation which is always suboptimal, but short to medium term the only option. It was my intention to show that while suboptimal entirely within the realm of the possible.

Scope

Litterature: ARM: [11], TLB[9], RowBuffer[10]

L3 is not small

Point here was to give a sense of scale on the importance of the L3 cache in the CPU. Also it’s kind of neat you can actually see slices of the L3 cache in a die shot.

Why is this interesting

The point of this slide was to illustrate the flexibility and power of CSC’s. Important to notice here: CSC’s are always attack on software implementation – not algorithms. An RSA implementation might leak it’s private key, another implementation may not.

Literature: Covert channels [13], RSA[4], AES[6], ECDSA[5],Keyboard[7],Mouse[8], KASRL[9]

How the data cache works on Intel + Important cache features

Here I wanted to introduce the design features that allows the attacker to manipulate the cache hierarchy. The second slide is just a summary slide.

 

How memory is stored in L3 + N-Way Set Associative L3

Here I wanted to introduce how the cache stores data. This is further back ground for manipulation, but also serves as a back ground for getting information from the cache. The “intel’s problem” is a so called fully associative cache on bytes. Essentially cache line concept limits accuracy of the attacks and the N-Way set associative cache allows an attacker to work on much smaller units making attacks practical. Further it allows an attacker to deduct information from cache-set congruence, something that is supported by the complex addressing function for the cache. I avoided the complex addressing function deliberately to avoid complexity and just mentioned that “the colors are not clumped together, but spread out”. Cache side channel attacks has historically only approximated this complex addressing function by doing a simple timing attack – loading N+1 addresses from the same set will cause one not to be cached and thus accessing it will be slow. The complex addressing function for cache sets has been reverse engineered resonantly for most intel CPU’s.

 

Example code to attack

The example code I use to explain the attacks is inspired by a function in the GDK library which is significantly more complex than this. I wanted to underline that code as well as data can be attacked and that branches matter. I wanted to speak about the big 3 attacks in the following slides by highlight some differences in a concrete example. The GDK library was first attacked by [7].

Common for all CSC

Here the goal was to introduce timing attack and all the cache manipulation primitives of the big 3 caches. Evict is just accessing enough cache lines congruent to a cache set to cause all previously stored information to be evict. Prime is a variation on Evict – where you carefully access cache lines congruent to a cache set in a way that the attacker knows what is in the cache set. The flush primitive is simply the clflush instruction, that removes a specific cache line from the cache hierarchy.

 

Big 3 Cache side channel Attacks

I wanted to comment that the 3 attacks and the variants of them make up most CSC’s and that they are all fairly similar in structure. Also I wanted to point back to the primitives described in the previous slide.

Literature: Evict+Time[1], Prime+Probe[2], Flush+Reload[3]

“The 4 slides on the attacks”

In these slides I wanted to highlight some advantages and disadvantages for each attack. The Evict+Time does not give any temporal resolution – which I call “post mortem analysis”, but you don’t need to worry about synchronizing with victim. Synchronizing can be done by any means including a separated CSC, a function call or any other information leak. Or even spying constantly. Though the accuracy is cache congruence it should probably be noted that prime and probe is disturbed by any cache set congruent memory operation whereas Evict+Time is only disturbed by those who evict exactly those cache lines used by the function. However, calling a function tends to bring more non-relevant code into play and thus noise. Noise is a more complex topic that the impression the overview slide gives.

Shared memory

I heard the „nobody shares memory“comment one too many times and wanted to raise a flag that it isn’t that uncommon. Finally, I wanted to warn against shared memory – particularly as it’s introduced through deduplication as that’s the most common vector in VM environments. Flush+Reload is just an example of the problems with dedublication, but one can can pile on with D’Antoine’s instruction reordering attack, dedublication attacks, more speculatively row hammer can become a practical cross VM vector with deduplication, row buffer side channel attacks etc. etc.

 

Noise

I wanted to point out that CSC’s are noisy. Usually the noise is due to contention with irrelevant code running on multicore CPU’s or contention in other subsystem than the cache. Also the hardware prefetcher can destroy things for an attacker. Of these effects only the effect of the hw prefetcher cannot be eliminated by repeating the attack – though obviously not all attacks lend themselves to be repeated (you cannot ask a user for his password 10k times). Sometimes you get good results in the first attempt. I had an Evict+Time attack that required more than 10k attempts to give satisfying results.

 

Performance counters

My „agenda“on my blackhat talk last year was to communicate that performance Counters can be an important tool for security related weirdness in the CPU. Part of this slide is an attempt to repeat that message. The important part was to introduce performance counters, performance counter interrupt and setting up the counter as they form an important part of my work on detecting cache side channel attacks.

Flush + Reload code

Here the Clflush is used as a “manipulate cache to a known state primitive”.

The Mov instruction (which could be replaced by most memory access instructions) is the reload phase. The RDTSC(P) does the actual timing of the reload phase and the mfence instructions prevents that the CPU reorder the instructions to do the reload phase outside the rdtsc(p) bracket.

My comments read that I should explain that cache side channel attacks like flush+reload is in a race condition with the process being spied upon.–Say if we’re attacking keyboard input we’ll be less visible if we wait a few milliseconds per iteration because nobody types that fast whereas for crypto we almost always need much higher temporal resolution and usually wouldn’t wait at all.

Detecting flush+reload

My original suggestion was to see how long a given count of cache misses would take. If too fast we had a cache side channel attack. [12] and [13] improved on that. All works fairly well.

Row hammer was flush + reload

Just noting here that if we remove the information acquisition and do it for two addresses at once we have the original row hammer code. It’s then easy to see that row hammer was a flush+reload attack. The “was” word was carefully chosen. Others has shown that the movnti instruction is a vector for row hammer too, and that vector is not a row hammer related attack. To round off my introduction I hope I mentioned that rowhammer.js was a flush+reload variation that I (and others) usually call Evict+Reload using the eviction primitive I discussed in a previous slide.

 

Flush + Flush

The back story here is I’d figured that clflush would leak information about the cache state and when approached by Daniel Gruss and Clementiné Maurice about detecting a cache attack that causes less cache misses I immediately knew, what they were talking about. Instead of competing to publish I did not do more work in this direction. I did get to review their wonderful paper though.

Flush+flush works with replacing the mov instruction in flush+reload with a clflush but is otherwise identical. The clflush instruction is probably faster when the cache line parameter isn’t in the cache because it’s able to shortcut actually flushing the cache.

Flush+flush has an advantage beyond stealthiness: clflush is faster than mov on uncached memory. Also it leaves the cache in a known state which means the first line of code can be skipped when iterating the attack. This attack is probably the fastest CSC. Also the clflush instruction introduces less problems with the hardware prefetchers. Literature: [13]

Why is flush+flush Stealth

Clflush does not cause cache misses! However, the victim still incurs cache misses due to flush and reload. This usually isn’t sufficient for the flush+reload detection I outline in previous slides to get decent detection rates without incurring significant amount of false positives.

 

Detecting flush+flush

The first point is an assumption that we’re not talking about a cross VM attack. My opinion is that cross VM flush+flush attacks should be foiled by not using dedublication. It’s also an assumption that I need. In a cross VM attack the attacker could use ring 0 to attack and thus bypass the CR4.TSD access violation. However, it is worth noting that even in this scenario it would make flush+flush more difficult.

The other 3 points are just the technology I use to catch flush+flush with.

Stage 1

This is actually a revamped version of my “Detecting flush+flush” blog post. After posting that I had some private conversations where my method got criticized for being too costly performance wise. So I first tried to find a better performance counter. There was an event that actually mentioned CLFLUSH in the offcore counters, unfortunately it was only available on a small subset of microarchitectures that I deemed too small to be worthwhile. I actually tried to see if intel just changed documentation and it appears they really revamped that event for other purposes. Then I played around with the CBO events in the cache system, though I could get better than cache-ref it was at the cost of being gameable from an attacker view point. So instead I decided to make a two stage approach – first detect the bracket and then detect clflush instruction. I had two reasons for this approach. The first is to deal with the performance impact of the second stage which can be quite severe. The second is I have a feeling that we’ll see other instruction-timing attacks in the months to come and this two stage model maps well to defending this kind of problem. The reason why this two stage works better is that the RDTSC instruction itself is rare, but in pairs spaced close enough for an attacker to not drown in noise is so rare that I’ve not seen a single benign application causing this in my testing.

Problems?

Using CR4.TSD to make rdtsc privileged affects performance two fold. First it causes a very significant slowdown on the RDTSC instruction with emulating it in an exception handler. However, the RDTSC is rarely used – in particularly rarely used in a fashion where it makes up a significant part of the total execution time and thus I think the performance penalty is acceptable. Much worse is that the RDTSC instruction becomes less accurate which might cause issues. Short of profiling code, I’ve yet to come up with a benign case where such accuracy would be required. I may be wrong though.

The detection obviously hinges strongly around the RDTSC being used as a timer. I’ve yet to come up with a good alternative to rdtsc, but that doesn’t mean it doesn’t exists.

 

Omissions

The Gory details left out is mostly eviction policy and complex addressing function related stuff. Such as finding eviction sets and priming the cache. There are many suggestions for mitigation – not sharing memory being one that’s very good but incomplete as it doesn’t work for evict+time and prime+probe. There are different ways of adding noise or making the timer less accurate – all in my opinion fundamentally flawed as noise or less timer accuracy can be defeated by law of large numbers. The Cache Allocation Technology can be used to protect against CSC’s – here I’m not a fan because it requires software to be reauthored to be protected and that it has limited coverage. Finally, it’s worth mentioning writing branch-free-cache-set-aware code which actually works, but is difficult and we’re unlikely to be able to demand that all affected software does so.

BlackHoodie and what came after

The demand for information security specialists will experience a growth by more than 50% through 2018, thus our industry is in massive need of talented engineers. On the other hand, the field is rather devoid of female engineers, who seem to frequently scare away from the professions in the security sector.

To encourage female engineers to step up to the challenge of reverse engineering, in 2015 Marion Marschalek decided to organize a free reverse engineering workshop for women, dedicatedly inviting females only. The motivation behind this move was to give female engineers the prospect of a comfortable learning environment; to make them feel entitled to take part, rather, than scare them away. Reverse engineering is considered one of the more complex fields of computer science, good learning material is not always freely available, a steep learning curve in the beginning demotivates a lot of students to get their heads around the materia. Thus the idea to host an event which would support one of infosec’s minorities, the ladies.

The workshop took place in September at University of Applied Sciences St. Pölten, Lower Austria. It was hosted on a weekend, in order to ensure a maximum number of participants could take part.

A total of 15 participants fought their way to rural Austria, which is impressive considering the overall scarcity of female engineers at common security events. It is evident that hosting such an event as a female-only workshop encouraged more female participants to join than there would have been at a general training course. The attendees came traveling from several different countries, many of them on their own expenses.

Prior to the workshop the participants had to complete 4 preparational exercises, in order to familiarize themselves with general concepts of malware reverse engineering. The topic itself is not easy to comprehend, more over, learning to reverse engineer binaries on a single weekend is surely not feasible. The preparational tasks were meant to instruct the participants on how to set up their own malware analysis environment, also the basic concepts of static and dynamic analysis were to be internalized. The tasks included several exercises to complete, but also a list of reading material to cover topics such as malware anti-analysis and runtime packers. Finally the participants had to perform first reverse engineering tasks by steering a debugger through a set of minimalistic binaries.

The workshop itself started off with high goals. During the designated weekend the attendees were confronted with a sample of real world malware, protected by a custom runtime packer and a number of anti-analysis measures. The binary itself was a variant of Win32.Upatre and rather compact, generally suitable for beginners, yet showing all the traits that make everyday malware. Still, a single weekend is a rather short timeframe, thus the attendees were required to put in a lot of energy; and concentration, guts and endurance. After a short recap of prior exercises the class started out on analysing and bypassing protection measures of the malware. These included self modifying code, a breakpoint detection trick and execution of code within a window handler function. After the protection layer the malware entered a decompression phase, followed by the import table reconstruction commonly performed by runtime packed binaries.

The final payload of the malware, a function to download further executable content from a remote server, was left as a homework due to timing constraints. Nevertheless, the attendees showed a lot of interest during the training and were an extraordinarily eager class.

Following the principle that easy content just won’t stick, the training weekend was intentionally not designed to be quickly digested. The attendees got a crashcourse on malware reversing, from where they were free to go on on their own to explore further challenges. The primary intention of the workshop was of course to familiarize the participants with binary analysis, furthermore though they were supposed to take away the credo “yes, I can” when looking at complex tasks in the future.

Indeed, now half a year after the workshop the former BlackHoodie attendees have shown marvellous success stories. Two of them have taken on their first reverse engineering positions with Quarkslab in Paris. One did her first malware research talk at Botconf last year, presenting on botnet analysis, and is going for the next speaking engagement soon; one spoke at RootedCon this year about iOS malware attacking non-jailbroken devices. Two ladies decided to pick up RE as topic for their thesis, one focusing on analyzing threat actor TTPs, one on analyzing the NDIS stack relying on memory images. Finally, an eager participant collected her first two CVEs this year by exploiting BMC Logic’s BladeLogic Server Automation product, presenting the findings at Troopers conference. Needless to say, among the participants are seasoned engineers, who excel in cryptography, software development, incident response, and security management every day.

We have high hopes last year’s BlackHoodie attendees keep up the good work and we are looking forward to the next edition, coming up in fall 2016. The next event will be hosted by Marion Marschalek and Katja Hahn in Bochum at the G DATA Campus. It will again be free of charge and hopefully encourage even more female engineers to wreck their brains over binary analysis.

Presenting PeNet: a native .NET library for analyzing PE Headers with PowerShell

screen1The last years have seen a strong rise in PowerShell and .NET malware, so in this article we go native and show how PowerShell and the PeNet library can be leveraged to analyze Portable Executable (PE) headers of Windows executables, motivated by, but not limited to .NET binaries. PeNet is published under Apache License Version 2.0 and maintained by the author. Find an API description for C#, VB, C++ and F# in the PeNet API Reference.

Why PowerShell

PowerShell is an object-oriented programming language with full access to all .NET and Windows libraries. This is one of the reasons why PowerShell became more and more popular in malware and exploitation. The syntax makes it easy to use for everyone familiar with C/C++ or an equivalent modern programming language. Since most malware analysis is done automated to process huge amounts of malicious executables, a scripting language comes handy in this context. As every Windows PC has PowerShell and an IDE for it installed, it is easy to start coding and you can be sure that your code runs basically on every computer running Windows.

About PeNet Library

The PeNet library is a .NET library written in C# which parses PE headers of Windows executables. It can show and interpret values, flags, resources and so on, and is also capable of changing the values. The library is under the Apache 2 license and can be used by everyone in their projects.

Analyzing a PE Header

In the following section we walk through and example to demonstrate how to analyze the PE header of a Windows executable using PeNet. 

First, we need to import the PeNet library in PowerShell:

# Import the PeNet DLL

[System.Reflection.Assembly]::LoadFrom(“C:\PeNet.dll”| Out-Null

Now, we define a path to an executable and a new object which represents the PE header.

# Windows Executable to analyze

$executable ‘C:\Windows\System32\calc.exe’

# Create a new PE header object with PeNet

$peHeader New-Object PeNet.PeFile -ArgumentList $executable

The $peHeader object contains all information we can get from the PeNet library. 

Basic File Information

Now, let us check, if the file we are analyzing is a PE file or not and display some information about the PE file.

Write-Host “Is PE File`t” $peHeader.IsValidPeFile

Write-Host “Is 32 Bit`t”  $peHeader.Is32Bit

Write-Host “Is 64 Bit`t”  $peHeader.Is64Bit

Write-Host “Is DLL`t`t”   $peHeader.IsDLL

Write-Host “File Size`t”  $peHeader.FileSize “Bytes”

 

These lines are accessing flags and properties of the $peHeader object. The output for these lines is:

Is PE File    True

Is 32 Bit     False

Is 64 Bit     True

Is DLL        False

File Size     32768 Bytes

The output shows that the executable we are analyzing is a 64 Bit PE file and an executable but no DLL. The file size is 32768 bytes. To identify files, hashes are common in malware analysis, so let us display the most common hashes.

Write-Host “SHA-1`t”   $peHeader.SHA1

Write-Host “SHA-256`t” $peHeader.SHA256

Write-Host “MD5`t`t”   $peHeader.MD5

Output:

SHA-1         7270d8b19e3b13973ee905f89d02cc2f33a63aa5

SHA-256       700ee0e9c1e9dc18114ae2798847824577e587813c72b413127fb70b9cb042dd

MD5           c4cb4fdf6369dd1342d2666171866ce5

One more interesting hash is the Import-Hash (ImpHash), which is a hash over the imported functions of the PE file. The ImpHash can be used to identify similar malware, since the same ImpHash is a strong indicator for similar behavior, or at least the same packer.

Write-Host “ImpHash`t” $peHeader.ImpHash

Output:

ImpHash 251a86d312a56bf3a543b34a1b34d4cc

The ImpHash is used by VirusTotal, too, and besides the .NET implementation in PeNet, a Python library exists which can be used to compute the hash.

To check if the PE file is signed with a valid signature we can use the following PS code:

Write-Host “Signature Information” -ForegroundColor Yellow

Write-Host “Is Signed`t`t`t”        $peHeader.IsSigned

Write-Host “Is Chain Valid`t`t”     $peHeader.IsValidCertChain($True)

Write-Host “Is Signature Valid`t”  ([PeNet.Utility]::IsSignatureValid($executable))

The parameter $True indicates that PeNet should check the certificate chain online. If we set the parameter to $False, the check would be done offline, which could lead to outdated results.

The output for our example is:

Is Signed                   False

Is Chain Valid              False

Is Signature Valid          False

The executable isn’t signed at all, so obviously the chain and the signature are invalid.

DOS/NT Header

We have two options to get the values of each structure in the PE header. The first is
by value, where you can access every single structure member. The second methods
returns the whole structure with all value containing it as a string for easier representation in console output.

To access single values, let’s have a look how we can obtain the e_magic value from the
IMAGE_DOS_HEADER and how to access the Signature from the IMAGE_NT_HEADERS.

“e_magic`t {0:X0}” -f $peHeader.ImageDosHeader.e_magic

“NT signature {0:X0}” -f $peHeader.ImageNtHeaders.Signature

We start at the PE header and select the structure and value we want. The values are
printed to the console in hexadecimal.

Output:

e_magic        5A4D         

NT signature   4550

The PeNet library overrides a ToString() method for each PE structure. This means
that we can print the whole IMAGE_DOS_HEADER (and any other header) directly to
the console without caring for formatting or without the knowledge of the members of the structure.

Write-Host $peHeader.ImageDosHeader

Output:

IMAGE_DOS_HEADER

e_magic   :   5A4D

e_cblp    :   90

e_cp      :   3

e_crlc    :   0

e_cparhdr :   4

e_minalloc:   0

e_maxalloc:   FFFF

e_ss      :   0

e_sp      :   B8

e_csum    :   0

e_ip      :   0

e_cs      :   0

e_lfarlc  :   40

e_ovno    :   0

e_oemid   :   0

e_oeminfo :   0

e_lfanew  :   F8

File Header

The IMAGE_FILE_HEADER is a structure in the IMAGE_NT_HEADERS and can be accessed again as single values or as a whole.

For example, let’s access the machine type of the file header.

“Machine {0:X0}” -f $peHeader.ImageNtHeaders.FileHeader.Machine

Write-Host “Resolved Machine`t ([PeNet.Utility]::ResolveTargetMachine($peHeader.ImageNtHeaders.FileHeader.Machine))

In the first line, we print the machine field as a hex value. Since this value doesn’t
say much we can use the PeNet library to resolve the value to a meaningful
string as shown in the second line.

Output:

Machine              8664

Resolved Machine     AMD AMD64

Interesting in the file header are the file characteristics which can be accessed and
resolved like this:

“`nCharacteristics {0:X0}” -f $peHeader.ImageNtHeaders.FileHeader.Characteristics

Write-Host ([PeNet.Utility]::ResolveFileCharacteristics($peHeader.ImageNtHeaders.FileHeader.Characteristics))

The first line prints the file characteristics as a hex value and the second line resolves them to different flags. All flags can be access on their own, but we just print them all at once.

Output:

Characteristics 22

 

File Characteristics

RelocStripped                 :     False

ExecutableImage               :      True

LineNumbersStripped           :     False

LocalSymbolsStripped          :     False

AggressiveWsTrim              :     False

LargeAddressAware             :      True

BytesReversedLo               :     False

Machine32Bit                  :     False

DebugStripped                 :     False

RemovableRunFromSwap          :     False

NetRunFroMSwap                :     False

System                        :     False

DLL                           :     False

UpSystemOnly                  :     False

BytesReversedHi               :     False

And again to print the whole header just type:

Write-Host $peHeader.ImageNtHeaders.FileHeader

Output:

IMAGE_FILE_HEADER

Machine:                     8664

NumberOfSections:            6

TimeDateStamp:               5632D8CE

PointerToSymbolTable:        0

NumberOfSymbols:             0

SizeOfOptionalHeader:        F0

Characteristics:             22

Section Header

To access the sections of the executable we parse the section header and resolve the section characteristics to get an overview of the rights of each section.

$num = 1;

foreach($sec in $peHeader.ImageSectionHeaders)

{
    Write-Host “`nNumber:” $num

    Write-Host “Name:” ([PeNet.Utility]::ResolveSectionName($sec.Name))

    Write-Host $sec

    $flags ([PeNet.Utility]::ResolveSectionFlags($sec.Characteristics))

    Write-Host “Flags:” $flags 

    $num++;

}

The output for the first two sections with resolved name and resolved flags:

 

Number: 1

Name:  .text

IMAGE_SECTION_HEADER

PhysicalAddress:             16B0

VirtualSize:                 16B0

VirtualAddress:              1000

SizeOfRawData:               1800

PointerToRawData:            400

PointerToRelocations:        0

PointerToLinenumbers:        0

NumberOfRelocations:         0

NumberOfLinenumbers:         0

Characteristics:             60000020

 

Flags: IMAGE_SCN_CNT_CODE
IMAGE_SCN_MEM_EXECUTE IMAGE_SCN_MEM_READ

 

Number: 2

Name:  .rdata

IMAGE_SECTION_HEADER

PhysicalAddress:            151A

VirtualSize:                151A

VirtualAddress:             3000

SizeOfRawData:              1600

PointerToRawData:           1C00

PointerToRelocations:       0

PointerToLinenumbers:       0

NumberOfRelocations:        0

NumberOfLinenumbers:        0

Characteristics:            40000040

 

Flags:
IMAGE_SCN_CNT_INITIALIZED_DATA IMAGE_SCN_MEM_READ

Optional Header / Data Directory

The Optional Header is, like the File Header, a structure in the NT header and can be accessed by each member or as a whole. In the following we just print the whole header. Since the Optional Header includes the Data Directory structure it is printed (cut off), too.

IMAGE_OPTIONAL_HEADER

Magic:                      20B

MajorLinkerVersion:         C

MinorLinkerVersion:         A

SizeOfCode:                 1800

SizeOfInitializedData:      6A00

SizeOfUninitializedData:    0

AddressOfEntryPoint:        22E0

BaseOfCode:                 1000

BaseOfData:                 0

ImageBase:                  140000000

SectionAlignment:           1000

FileAlignment:              200

MajorOperatingSystemVersion: A

MinorOperatingSystemVersion: 0

MajorImageVersion:          A

MinorImageVersion:          0

MajorSubsystemVersion:      A

MinorSubsystemVersion:      0

Win32VersionValue:          0

SizeOfImage:                D000

SizeOfHeaders:              400

CheckSum:                   FB66

Subsystem:                  2

DllCharacteristics:         C160

SizeOfStackReserve:         80000

SizeOfStackCommit:          2000

SizeOfHeapReserve:          100000

SizeOfHeapCommit:           1000

LoaderFlags:                0

NumberOfRvaAndSizes:        10

 

IMAGE_DATA_DIRECTORY

VirtualAddress:             0

Size:                       0

IMAGE_DATA_DIRECTORY

VirtualAddress:             3D68

Size:                       DC

IMAGE_DATA_DIRECTORY

VirtualAddress:             7000

Size:                       4718

IMAGE_DATA_DIRECTORY

VirtualAddress:             6000

Size:                       180

IMAGE_DATA_DIRECTORY

VirtualAddress:             0

Size:                       0

IMAGE_DATA_DIRECTORY

VirtualAddress:             C000

Size:                       48

IMAGE_DATA_DIRECTORY

VirtualAddress:             3590

Size:                       38

IMAGE_DATA_DIRECTORY

VirtualAddress:             0

Size:                       0

IMAGE_DATA_DIRECTORY…

 

Imports, Exports and Resources

We have seen how the Optional Header and the Data Directory can be accessed. The most interesting parts of the Data Directory are the Imports, Exports and Resources. PeNet parses all these directories for us and shows us the content.

Let’s start with the imports. Since listing all imported functions would be too much for this article we will have a look at all function imported from Kernel32.dll

foreach($import in $peHeader.ImportedFunctions)

{
    if($import.DLL -eq “KERNEL32.DLL”)

    {

        $import

    }

}

This code iterates over all imported functions and checks if the DLL from which the function should be imported is the Kernel32.dll. If so, the imported function is printed.

Output:

Name                        DLL          Hint

—-                        —          —-

GetLastError                KERNEL32.dll  599

CreateEventExW              KERNEL32.dll  179

WaitForSingleObject         KERNEL32.dll 1483

SetEvent                    KERNEL32.dll 1291

FindPackagesByPackageFamily KERNEL32.dll  395

QueryPerformanceCounter     KERNEL32.dll 1081

GetCurrentProcessId         KERNEL32.dll  529

GetCurrentThreadId          KERNEL32.dll  533

CloseHandle                 KERNEL32.dll  124

GetTickCount                KERNEL32.dll  765

RtlCaptureContext           KERNEL32.dll 1210

RtlLookupFunctionEntry      KERNEL32.dll 1217

RtlVirtualUnwind            KERNEL32.dll 1224

UnhandledExceptionFilter    KERNEL32.dll 1441

SetUnhandledExceptionFilter KERNEL32.dll 1377

GetCurrentProcess           KERNEL32.dll  528

TerminateProcess            KERNEL32.dll 1407

RaiseException              KERNEL32.dll 1103

GetSystemTimeAsFileTime     KERNEL32.dll  736

Since we are analyzing an executable and not a DLL there are most likely no exports, but to be sure let us check.

if($peHeader.ExportedFunctions -eq $null)

{
    “Image has no exports.”

}

else

{
    “Image has exports.”

}

Output:

Image has no exports.

Nothing to see here, let’s move on and check for resources. We print the root elements of this executable and no subdirectories to keep the output readable.

 

foreach($de in $peHeader.ImageResourceDirectory.DirectoryEntries)

{
    if($de.IsIdEntry -eq $True)

{

      Write-Host “ID Entry” $de.ID ” Resolved Name: “ ([PeNet.Utility]::ResolveResourceId($de.ID))

      }

    elseif($de.IsNamedEntry -eq $True)
{ 

        Write-Host “Named Entry: “ $de.ResolvedName
    }

}

Output:

ID Entry 3  Resolved Name:   Icon

ID Entry 14  Resolved Name:  GroupIcon

ID Entry 16  Resolved Name:  Version

ID Entry 24  Resolved Name:  Manifest

Again, we used a resolve function to map the IDs of directory entries to meaningful names.

Pattern-Matching

Often, we don’t search for a structure member to have a specific value, but instead want
to match some byte or string signature over samples. That’s why PeNet comes with a build-in Aho-Corasick pattern matching algorithm for byte arrays and strings.

The idea is to construct a trie containing all signatures once and to use this trie to match multiple executables against the signatures. We will see an example below.

$trie New-Object PeNet.PatternMatching.Trie

 

$trie.Add(“MicrosoftCalculator”, ([System.Text.Encoding]::ASCII), “pattern1”)

$trie.Add(“not in the binary”, ([System.Text.Encoding]::ASCII), “pattern2”)

$trie.Add(“<assemblyIdentity”, ([System.Text.Encoding]::ASCII), “pattern3”)

 

$matches = $trie.Find($peHeader.Buff)

 

foreach($match in $matches)

{
    Write-Host “Pattern” $match.Item1 “matched at offset” $match.Item2

}

 

A new trie object is created and then filled with some strings to search for. The encoding is of the string is given, because string in binaries are often in Unicode. The pattern2 is not in the binary and should not be found, the other two patterns should.

The output is:

Pattern pattern1 matched at offset 9410

Pattern pattern3 matched at offset 14498

Pattern pattern3 matched at offset 14725

We see that pattern1 matched once and pattern3 matched two times at different offsets.

Conclusion

We’ve seen how PE headers can be analyzed with PowerShell and PeNet. PeNet is a .NET library, i.e., any other .NET language can be used, too. The library is still under development and a lot of PE features are still missing, but they will follow soon. None the less, it’s a mighty tool for malware analysis, if building large-scale, automated analysis systems with the .NET framework is what you aim for. Feel free to contribute to the PeNet library or fork it at GitHub.