The Kings In Your Castle Part 4 – Packers, Crypters and a Pack of RATs

In part 4 of our series “The Kings In Your Castle”, we’re back with the question, what does sophistication even mean? I’ll be outlining what complexity from a malware analyst’s perspective means, why malware intends to be undecipherable and why it sometimes just wouldn’t even try. Also, this blog entry serves to present our findings on commodity RATs within the corpus of malware we analyzed, as part of our talk at Troopers conference in March.

If you are interested in previous parts of the series please check them out here, here and here.

What does sophistication even mean?

The complexity of software is a rather soft metric, that hasn’t undergone much scrutiny in definition. For a malware analyst, this sphere even takes on a whole lot of different shades, as malware by nature aims to hide its threats. At least most of it, as one would think?

For analysts, what poses challenges are techniques such as code obfuscation or the use of well-fortified crypters. What also presents a remarkable challenge  is structured application design, although this might sound somewhat counter-intuitive. Multi-component malware with a well thought-out object oriented design and highly dependent components cause more of a headache for an analyst than any crypter out there. In fact, many of the well-known high-profile attack toolsets aren’t protected by a packer at all. The assumption is, that for one, runtime packers potentially catch unwanted attention of security products; but also, for highly targeted attacks they might not even be necessary at all. Malware developers with sufficient dedication are very well able to hide a software’s purpose without the use of a runtime packer. So, why do we still see packed malware?

A software crypter is a piece of technology which obfuscates software and its intentions, but also to changes its appearance. Crypters and packers are frequently applied to malware in order to ensure the reusability of the actual malcode. That way, when malware is detected once, the same detection will not apply to the same malware running on a different system.

Let’s take a step back though. The ‘perfect targeted attack’ is performed with a toolset dedicated to one target only. We can conclude that a crypter is needed if either the authors aren’t capable of writing undetectable malware, or, more likely, if the malware is intended to be reused. This makes sense, if you reconsider, writing malware takes time and money, a custom attack toolset represents an actual (and quite substantial) investment.

Finally, with economical aspects in mind, we can conclude that attacks performed with plain tools are the more expensive ones, while attacks using packers and crypters to protect malware are less resource intensive.

4_1_sophistication

The actual hypothesis we had started our research with was, that most targeted malware comes without a crypter. In more detail, we put up the statement, that targeted malware was significantly less protected than random malicious software. Again, random malware doesn’t know its target and by definition is intended to infect a large number of individuals; while targeted malware supposedly is designed for a limited number of targets.

Packer Detection (Like PEiD Was Broken)

Now, the usage statistics of runtime packers and crypers would be easy to gather from our respective dataset, if the available state-of-the-art packer detection tools weren’t stuck somewhere in 1997. The question whether a file is packed or not in practice is not trivially answered. One of the tools that is frequently used to identify packers is named PEiD. PEiD applies pre-defined signatures to the first code bytes starting from the executable entry point, which easily fails if the code at the packed binary’s entry point changes only slightly.

Aiming for more accurate results, we decided to come up with our own packer evaluation. We put together a set of indicators for abnormal binary structures, which are frequently seen in relation with runtime packers. Please note, the indicators are just that – indicators. The evaluation algorithm leaves some space for discrepancies. In practice though, manual analysis of randomly picked samples has proven the evaluation to be reasonably precise.

We gathered the following PE attributes:

  • Section count smaller than 3
  • Count of TLS sections bigger than 0
  • No imphash value present, thus import section empty or not parseable
  • Entropy value of code section smaller than 6.0 or bigger than 6.7
  • Entry point located in section which is not named ‘.text’, ‘.itext’ or ‘.CODE’
  • Ratio of Windows API calls to file size smaller than 0.1

Of course, no single one of the gathered attributes is a surefire indicator in a packer evaluation process. In reality, some of the mentioned presumed anomalies are frequently seen within unpacked binaries, and depend, for example, on the executable’s compiler. Nevertheless, any of the mentioned features is reason enough to grow suspicion, and as mentioned before, the evaluation works rather reliably on our chosen dataset.

According to our algorithm the values weigh into an evaluation score, based on the final score an analyst can then draw his conclusion. It is worth noting at this point, that the chosen thresholds are quite sensitive, and one would expect to rather detect too many “potentially-packed” samples, instead of too few.

Further details about our packer evaluation method can be found within the published code.

The results can be found in the following charts, showing the evaluation values in relation with sample counts. The maximum score a sample can reach on our scale is 220, meaning that all eval attributes exceed the chosen threshold. The following graphics show the evaluation performed on a benign sample set, on a targeted malware sample set and on a random malware sample set. Attention should be paid to the sample frequency count on the y-pane of the graph.

benign
The benign set
targeted
The targeted malware set
random
The random malware set

The graphs show very well, how roughly a third of the benign samples show low rated indicators; while for the random malware sample set, it is less than a third of the overall set showing no indicators, while more than a third of the set show remarkably high rated indicators. It shall be mentioned, that a look into benign samples rated at 40-50 resulted in the finding, that most of them were packed with UPX, a binary packer used mainly for binary compression.

The remarkable bit at this point is that the set of targeted malware binaries has the overall lowest count of packer indicators. This leaves us with two possible conclusions. Following our hypothesis that targeted malware is significantly less protected by crypters than random malware, we take this result as a proof.

On the other hand, what surely biases the result, is that the chosen attributes are potentially influenced by compilers used to compile the given binaries. This means though, as the results for the targeted set are notably homogenous, that the larger part of targeted malware within our dataset has probably not experienced exotic compilers either. As a research idea for future analysis I’d like to put up the somewhat far-fetched hypothesi, that most targeted malware is written in C/C++ and compiled using Visual Studio compiler. Curious, anyone?

Commodity RATs

Taking the question of malware sophistication further, in the past the analysis community was frequently astonished in the light of yet another incredibly complex targeted malware campaign. While it is true that certain targets require a high level of attack sophistication, most campaigns do not require components for proprietary systems or extremely stealthy methods. An interesting case of high profile attacks being carried out with commodity malware was uncovered last year, when CitizenLab published their report about Packrat. Packrat is a South American threat actor, who has been active for around seven years, performing targeted espionage and disinformation campaigns in various South American countries. The actor is most known for targeting, among others, Alberto Nisman, the late Argentinean prosecutor, raising a case against the Argentinean government in 2015.

Whoever is responsible for said campaigns did have a clear image of whom to target. The actor certainly possessed sufficient personal and financial resources, yet made a conscious decision to not invest in a high-end toolchain. Looking into the malware set used by Packrat, one finds a list of so called commodity RATs, off-the-shelf malware used for remote access and espionage. The list of RATs includes CyberGate, XTremeRAT and AlienSpy; those tools themselves are well-known malware families associated with espionage.

Again, using repackaged commodity RATs is notably cheaper than writing custom malware. The key to to remaining stealthy in this case is the usage of crypters and packers. In the end though, by not burning resources on a custom toolchain, the attacker can apply his resources otherwise – potentially even on increasing his target count.

Hunting down a RAT pack

Looking at the above facts, one question emerges: how prevalent are pre-built RATs within the analysis set at all? To establish a count for commodity RATs count we again relied on detection names of Microsoft Defender. The anti-virus solution from Microsoft has shown in the past to be rather slow in picking up new detections, while providing quite accurate results once detections are deployed to the engines. Accuracy at this point includes certain reliability when it comes to naming malware.

For evaluation, we chose to search for the following list of commodity malware:

  • DarkComet (Fynloski)
  • BlackShades (njRat, Bladabindi)
  • Adwind
  • PlugX
  • PoisonIvy (Poison)
  • XTremeRAT (Xtrat)
  • Handpicked RAT binaries

The selected set is what we noted seeing when going through the malware corpus manually, please note though, the list of existing commodity RATs is by far longer.

The so to say “lazy king of APTing” is PlugX. The commodity RAT popped up in all together 15 different events listed in the MISP database.

4_2_plugx

The winner in sample numbers was Adwind, a Java based RAT, dedicated to infect different platforms. Adwind itself is malware, that has been sold under different names, including Frutas RAT and Unrecom RAT. Security firm Fidelis published nice insights on Adwind, under the much appreciated title “RAT in a jar”.

The following graphic shows the total number of RATs and related events, found within our data set of 326 events containing 8.927 malware binaries.

4_3_ratpack

In total, we counted that almost a quarter of inspected events made use of one or another RAT. Looking at the sample set, 1/9th of the total set is composed of pre-built RATs. These numbers are rather high, considering that, at least in the heads of analysts, targeted malware is complex and sophisticated.

Still, though, why do we even bother with high numbers of commodity malware? For one, as mentioned before, they help driving down attack cost. Furthermore, they provide functionality that is quite advanced from the beginning, equipping even unskilled attackers with a Swiss Army Knife of malware they couldn’t implement themselves, even if they tried really hard. These do-it-yourself APT tools enable wannabe spies with little budgets, growing the number of potential offenders.

Furthermore, off-the-shelf RATs have been seen in use by certain advanced attackers. They could lead to confusion about the actual offender, as they do not allow for attribution on the base of the binaries at all. In other words, one would not know whether he is being targeted by a script kiddie or a nation state actor. Currently, it remains unclear whether commodity RATs have been used in an attribution concealment approach, but the assumption does lie close.

The Kings In Your Castle Part 3 -Ssdeep being fuzzy while exploits are being scarce

Welcome back, still on it? This is part 3 of our blog series, if you’re curious about part 1 and 2, please check them out here and here. This time I’m happy to introduce a set of borderline funny findings and tackle one of the hypotheses we put together for Raphael Vinot’s and my talk “The Kings In Your Castle”, presented at this year’s Troopers Conference in Heidelberg. I will discuss our findings regarding exploits present in known targeted attacks, the obstacles we faced during analysis and how we worked our way around. So, sit back, relax, here we go.

Curiosities

As you might be aware of, most data sets come with information, as well as most show one or another curiosity. Finding curiosities means learning literally unexpected things, which is why researchers jump at those with the passion of a hungry wolf. Thus, let me start the list of our findings with a curiosity.

While performing clustering on ssdeep hashes we found something we dubbed sddeep collisions, due to lack of better naming. Ssdeep is a program for computing context triggered piecewise hashes. These so called fuzzy hashes, as opposed to cryptographic hashes, do not uniquely identify a given data blob. They are calculated piecewise and are able to identify shared byte sequences among two targets. They are frequently used to ‘describe’ malicious binaries, in order to be able to match with similar binaries and eventually find groups of malware samples or even identify malware families.

The nature of piecewise hashes though implies, that hashes of two binaries cannot be identical, if the binaries show differences. Hence, it is a curious finding, that a number of unique samples within our set showed identical ssdeep hashes. Without spending too much time picking at the implementation of the fuzzy hashing algorithm itself, we assume that ssdeep does not consider the entire binary for hashing. We found a case, where 5 MB of empty padding were no reason for ssdeep to show any difference in the resulting fuzzy hash.

ssdeep

More than padding, ssdeep on some occasions indeed missed significant differences in the data sections of compared binaries. Given that analysts and corporations use ssdeep in work benches and production systems we found it worth a mention, that identical fuzzy hashes do by no means proof, that the compared binaries are identical.

diffs

We learned another curiosity when randomly staring at the gathered data. It is fascinating how the human eye manages to find patterns, and indeed very instructive before starting to implement queries or planning for application of machine learning. This way we saw, that for example compilation timestamps of binaries usually follow lose patterns within an event. A number of events though show outliers when it comes to timestamps; such as a binary compiled in 1996 while others are compiled post-2007, or a binary with a stamp from 2036. Of course, such outliers can have multiple explanations. Ones that come to mind the fastest are either the timestamps are falsified, the attackers forgot to falsify all timestamps, the campaign made use of readily compiled (old) tools, or maybe a runtime packer was used which falsifies the timestamps without explicit intention of the attackers.

20136time

One conclusion though lies at hand. To freely quote a splendid APT researcher, attackers learn just like we do and improve over time, which implies that they might have made more mistakes in the past. Thus, by analyzing historical data about a campaign or group one might be able to learn unexpected tidbits. Moreover, by looking at things learned by the attacker as in changes in malware and intrusion techniques, one might gather insights about obstacles the attackers faced in the past. Adoption of a runtime packer or adding of a stealth module to a given RAT might expose, that the attacker’s tools at some point were detected by security software.

1970time

 

OMFG!! They used e.x.p.l.o.i.t.s.!!1!

Like in real life, humans tend to conclude that a digital attack, which caused big damage, naturally came with big power. In the world of cyber though, this equation has a lot of unknowns. In fact, the success of an attack and the damage it can cause are influenced by many factors, that are independent of an attacker’s capabilities and wealth. While not true in all cases, mostly, the possession of one or more 0-days involves some level of resources or at least explicit know-how combined with engineering time.

This leads to a natural assumption: Folks who do APTs involving 0-days must be either rich, or very very dedicated. Or both. Or do they? When Stuxnet finally happened, the general public seemed to believe that APT goes hand in hand with 0-day. A considerable time span passed by, understanding started to sink in, that targeted attacks can have all sorts of faces, and barely any post-Stuxnet attack looked anything like what we now call “the first cyber weapon”.

Until today, analysts seem to have settled for the consciousness that Word Macros are just as dangerous to organizations as the latest Flash exploit. There always is someone to open the damn document.

Finally, what this leaves us with is a set of uncertainties. How important are exploits in the APT world? How frequently are they used, how common is the use of precious 0-days? This is the fog we meant to shed some light on.

 

Exploit prevalence at a glance

In the mass malware scenery, the number of malware strains and infection campaigns that make active use of exploits is rather low, it feels, and seemingly even declined; at least since attackers found out that Macro downloaders do the job just as well. It won’t fail the attentive reader; Word Macros are a lot easier to write and cheaper to get hold of. And back here we are, reducing the cost of an attack allows to maximize the number of potential targets. It’s all about resources.

But let us get to the numbers. In total, we analyzed 326 events within our final analysis set, of which 54 were labeled to involve one or more exploits. Such labels are usually tags of CVE numbers that are added by the initial reporter of an event. About these tags we do know, that a present tag is a strong indicator for an actual exploit being involved (given analysts didn’t make things up); the lack of any tag does not proof at all though that no exploits were used. As a counter metric, we utilized detection names of Microsoft Defender, filtering for names containing one or another CVE number. This way we detected a total of 68 events involving exploits.

Juggling numbers, with considerations of potentially missed detections in mind, roughly a fifth of the analyzed events involved the usage of exploits. With all potential data incorrectness in mind, we are very confident to state that is it not a majority of targeted attacks that involves exploits.

 

The APT exploit landscape

Relying on tags that are present in the MISP data set, we went on to evaluate the exploits we indeed did see. The graphic below shows a chart of CVE numbers, sorted first by tag counts, secondly by year. The numbers refer to the number of events that make use of the listed CVEs.

It is worth mentioning, that human analysts as well as security software tend to be more reliable in labelling of well-known exploits, than fresh ones or even unknown ones. This chart cannot be used to determine which attacks involved 0-day exploits; in fact, none of the data we got at hand can.

cves

What it does show though is how the curve from frequently to non-frequently seen CVEs holds remarkably old vulnerabilities in the top spots. Absolutely killing it is CVE-2012-0158, a vulnerability within an MS Office OCX library. It can be triggered through a malicious Word document. The vulnerability has long been fixed by Microsoft, but, perhaps, not all Office installations are updated all that frequently? Who knows.

Furthermore, we can see that only a minority of 7 CVE numbers can be called more or less up to date. Given that our data collection ended January 2016, vulnerabilities from 2015 are considered fresh (enough). A total of 12 events involved exploits for non-cyber-stoneage vulnerabilities.

Exceptionally interesting is place number three on the list, CVE-2015-5119, sported by a total of five events. This vulnerability has a history, indeed.

 

HackingTeam exploit gone wild

CVE-2015-5119 is a vulnerability in Adobe Flash, which got leaked with the tremendous breach of the Italian offensive security company HackingTeam last year. The vulnerability was reported and fixed soon after the breach, but nevertheless made it into at least one exploit pack and the toolsets of four quite well known APT groups. According to our data, Group Wekby and a not closer identified spearphishing campaign targeting US government institutions adopted respective exploits right after the breach went public, in July 2015.

The most recent spotting of CVE-2015-5119 within our data happened beginning of 2016 when it was seen in context with BlackEnergy, the notorious malware that took the Ukrainian power grid offline end of 2015.

5119

 

Discussing the results

The numbers in the dark, or, everything we do not know, is a significant blind spot to our dataset. There are two considerable unknowns. For one, we do not know whether the data is complete. Two, we do not know whether the labels we retrieved are correct.

Concerning problem number 1, intentionally incomplete reports aside, it is very well possible that an attack is detected and remediated, but the actual infection vector never recovered. In the case of an exploit being used, given for example that it is not contained in an e-mail or a file, the forensic reconstruction of the entire intrusion path can be challenging. A good example of such a case and also a very instructive read on the topic poses Phineas Fisher’s write up of the HackingTeam breach.

Problem number 2, incorrect labeling, stems from the fact that determining a CVE number for an actual exploit involves careful analysis work of a specialist. In practice, deriving CVE numbers from AV detection names is “cheap” and works reasonably well, but relies on the respective analysts doing a scrupulous job when looking into the sample. Nevertheless, mistakes are actually quite possible.

As in all given cases, I am happy to receive suggestions on how to improve on both shortcomings. Meanwhile, we present the discussed numbers with a pinch of salt on the side.

 

The Kings In Your Castle Part #2 – Dataset and feature extraction

Welcome back to my series of write-ups for “The Kings In Your Castle – All the lame threats that own you, but will never make you famous”. This series covers a project I presented together with Raphael Vinot from CIRCL Luxembourg at Troopers conference in March. If you missed the start, you can find it here.

O’right lets go 😀

TTPAs – Tools, Techniques, Procedures and Actors

The primary aim in the toolification process was to extract accurate binary features, that would help us describe the binaries in relation to the Event data stored in MISP. Therefor we took the feature extraction a step further than usual IOC creation would (Indicators of Compromise).

IOCs are indicators, which describe malware used in an attack or attributes of an attack. They are easy and comparably quick to create, and then distributed and leveraged to scan computers or networks for compromises. They defer from classical, heuristical malware detection, as indicators are not limited to a per-file basis but can also describe domain names of remote servers, names of files created by malware or IP addresses.

Despite their many advantages though, IOCs trade rather shallow features. Domain names, file names or strings within binaries can be easily changed in the course of an operation and at will of the actor. A goal of our research was to extract more granular file features from different domains than the usual IOCs cover, in a sense, more “expensive” features, that we considered less volatile than domain names. This way we expected to be able to find links among different events contained in MISP, that the usual indicators miss. In a targeted operation, it is considered expensive to change a toolset, rewrite malware or switch infection vectors like for example the purchase of a new exploit. Currently used indicators lack capabilities to describe “expensive metrics”, hence the idea to widen the feature space.

However, extraction of binary features is not at all a trivial task. The technical implementation of feature extraction aside, it lies within the nature of malicious binaries to hide their features as thorough as possible; any features, that is. Runtime packing, obfuscation and sandbox evasion are just a few of many techniques that malware authors use to hinder analysis, which in general difficults feature extraction.

The following lists show attributes that were gathered from MISP, as well as those we extracted from the malicious binaries. The attributes are all gained in a static manner, meaning without the execution of binaries. Sole static analysis is frequently faster than dynamic analysis, the tools more portable and large scale analysis more scalable. Next to that we worked with the hypothesis, that targeted malware frequently comes unpacked and without the use of obfuscation. On the other hand, if an actor decides to rely on runtime packing, it should be an interesting question, whether he decides whether to use a custom packer or a commercial one, and whether samples were individually packed, with a dedicated packer. One would think ‘no’, right?

I will go into more details on the packer-or-no-packer questions in a follow up blogpost. For the time being, I’ll ask you for the benefit of doubt that our test set of binaries supplied us with considerably useful data. Here goes our feature list, fitted with short explanations of the underlying trail of thought.

MISP data

  • Event ID
  • Event description
  • Event Submission date
  • CVE numbers associated with malware of given event
  • Domains associated with malware of given event

The attributes we pulled out of MISP mainly describe the respective events, which the binary hashes of our set are linked to. This information is highly valuable, as it puts the malware into context. Frequently events in MISP are linked to vendor reports, which provide a plentitude of context for an operation. Furthermore, the event submission date roughly indicates the time when the operation was reported. CVE numbers are considered an indicator, whether the operation involved exploits. This is a rather soft metric, the lack of an entry for a CVE number does not at all proof that no exploits were being used by a given malicious actor. Nonetheless, listed CVE numbers are valuable information.

Sample data

  • MD5
  • SHA1
  • Associated Event ID
  • Filetype
  • Filesize
  • Ssdeep hash
  • Detectionname of Microsoft Defender

Sample data is a set of descriptive features, associated with any filetype. In the course of our research we put our main focus on Windows executable files, as these pose the biggest group among the analyzed sample set. Our decision to add detection names from the Microsoft Defender anti-virus solution bases on Microsofts accuracy in assigning names to malware. This knowledge we draw from sole experience, although empirical tests have shown excellent results in the past.

An interesting attribute among this set is the ssdeep hash of a file. Ssdeep is an algorithm, which allows to calculate a fuzzy hash of a given data blob. Fuzzy hashes do not uniquely identify the base data, but are calculated piece-wise. This way ssdeep makes it possible to determine similarities among files, and even draw conclusions about the degree of difference between two given files. For more information about ssdeep and fuzzy hashes please visit the sourceforge website. A drawback of fuzzy hashing is, that the required computing load for comparing similarities among binaries increases considerably with the number of binaries.

Executable data

  • Compilation time stamp
  • Imphash value
  • Entry point address
  • Section count
  • Original filename
  • Section names for sections 1-6
  • Section sizes for sections 1-6
  • Section entropies for sections 1-6
  • Number of TLS sections
  • Entry point section
  • Count of calls to Windows APIs
  • Ratio of API call count to filesize

Finally, for the subset of Windows executable files we collected a wealth of descriptive indicators, which apply to meta-information and the internal structure of compiled binaries. Compilation time stamps of binaries can be easily changed, that is true, therefor they have to be taken with a pinch of salt. Singular cases though have shown, that looking at a campaign over a period of time, following related events on MISP that is, sometimes yields unexpected information “leaks”. This means, actors might follow the habit to falsify timestamps, at the same time though erring is human, and sometimes we encounter binaries with seemingly real time stamps. That said, obviously it is of interest to find attacks related to one specific incident, as historical data can reveal unknown traits of a specific actor.

A number of PE attributes serves the detection of packed binaries. The count of PE sections, section names, sizes and entropy, the count of TLS sections (Thread Local Storage) and the section of entry point for example are considered indicators, whether a runtime packer protects the executable. This is interesting information by itself, as it can be concluded which actors use packed malware, which don’t, and which packing habits the different actors have.

Next to that, the attributes also serve to determine the similarity among binaries. While on unpacked binaries, the attributes are highly dependent on the compiler used to compile the binary, on packed executables the same data shows similarities of the various packers.

Two rather uncommon binary metrics we came up with is the total count of calls to Windows APIs within the code and the API call count to file size ratio. The primary consideration hereby is, that packed or protected executables show little interaction with the Win32 API. Another interest in these metrics though is, that the API calls within a binary relate to the actual binary structure. While code changes or additions within pieces of malware very likely change fuzzy hashes, the imphash, the filesize and section attributes, the changes of the API call scheme should remain considerably small.

Data about the data

The beauty of the data collection process, is that it left us with a set of malicious binaries, that are somewhat guaranteed to have been part of a targeted attack approach at some point in the timeline of their use. Furthermore, with the help of MISP we are able to put the binaries into context, as we know in many cases which campaign they were involved with.

For picking events from MISP we applied a lose set of criteria. MISP’s events are pre-classified by the submitter, where a threat level of medium or high indicates, that this event is likely of a targeted nature. From this group we handpicked events, where the description suggested it deals with a known or at least an identified threat actor, or where the nature of the malware clearly suggested a targeted background; like e.g. themed spear phishing would.

The initial data collection started November 2016, so the total set of events only includes cases submitted to MISP until middle of December 2016. However, in follow-up work some of the feature correlation procedures have been adopted by MISP itself. For more details please refer to the website.

Please note, this procedure leaves quite some room for incorrectness. We assume by default, that indicators reported by vendors and their context are correctly associated. This is not always the case, as we found out while performing tests; in some rare occasions data in vendor reports has turned out to be incorrect. As of now we do not have insight which percentage of reports shows errors. Furthermore, the events contained in MISP only show information that actually is reported, meaning that attacks which by the time of analysis yet have to be discovered as well as attributes which are potentially missing from reports pose a blind spot to our research.

Finally, we started off with a set of 501 events, which we assumed to contain information about targeted attacks, containing a total of 15.347 malware samples. From this set we removed another subset of events, which we determined to be overlapping with another event of the same attacker or incident. Duplicate entries happen, as MISP is not striving for accuracy but for completeness. The cleanup left us with a set of 326 events and 8.927 samples.

filetypes_basicdata

The graphic shows the file types of the entire sample set. It can be seen, that Win32 PE executables are rather dominant. This is explained by the heavy use of repackaged commodity malware by some actors, but does not represent the general distribution of file types per event. Nevertheless, PE32 is the most important file type within the analyzed sample set, counting more than 11.000 out of the total corpus of 15.347 samples.

In the next blogpost I’ll be introducing our results of an exploit-per-APT analysis, and write about one or another curiosity we found within our final data set.

 

Micro architecture attacks on KASLR

 

Introduction

Recently a number of different micro architecture attacks on Kernel Address Space Layout Randomization(KASLR) have been published. The idea behind KASLRis that code reuse attacks and read/write primitives are useless if the attacker is unable to tell where what code/data is. Consequently modern operating system randomizes the virtual addresses where code/data is stored. However it turns out the CPU leaks information about the address space layout causing failure in the defense. This blog post shortly summarizes the 4 known micro architecture attacks on KASRL and discusses potential mitigations. Part of the mitigation portion will be speculative. Bias up front: I’m a coauthor of the prefetch attack and quite like it.

What Do I mean by micro architecture KASLR break

When I wrote my first blog post [8] on breaking KASLR with micro architecture, I got criticized for using the term “micro architecture”. What I mean to convey when I use this term is that there is nothing in the Instruction Set Architecture (ISA) that specifies that the x86 has to work in ways that allows us to attack KASLR. Rather the features that are being used, are part of the implementation of the ISA. This usage of the term seems in line with common usage in the academic literature. There have been attacks on KASLR using page deduplication e.g. Barresi et al. [1] which some consider a microarchitecture attack. I however, do not and will thus not include these attacks in this discussion.

History of micro architecture KASLR breaks

 

Using page faults

Hund, Willems & Holz [9] was the first to publish an attack on KASLR. They use a combination of a classical prime+probe cache attack and a novel attack on the paging caches the time it takes for a fault to be thrown on unprivileged access to a page on x86-64 differs significantly depending on weather the translation is cached in the paging caches. An attacker can measure this difference. Further even an unprivileged access to a page causes the translation caches to be loaded and consequently you can use a timing difference attack to tell if memory is mapped on a certain location. They called this the “double page fault” attack.

Using Prefetch

I wrote a blog post in January [8] that become my starting for the paper by Gruss et al [6]. The Hund, Willems & Holz [9] attack suffers from the drawback that the timing is measure across the page fault handler which adds significant noise. This makes the attack itself slow and due to noise the attack needs to be repeated a significant amount of times to become reliable. What I and my coauthors noted was that the prefetch instructions do not cause page faults – they terminate silently. They do however have the same timing properties, as they do the same look ups. Consequently replacing page faults with the prefetch instruction provides a faster and more reliable KASLRbreak.

 

Using TSX

Wojtczuk wrote in a blog post [3] that intel TSX technology might make Hund, Willems & Holz’s [9] attack significantly more practical.  Jang et al.[ 2] took this up and showed Wojtczuk [3] was right. TSX works by being able to roll back a set of memory accesses, if they cannot be completed immediately due to conflicting access in the relevant portions of the memory sub system – much like transactional access to databases. However, access to a page that would normally trigger a page fault inside a transaction now just causes a rollback. Consequently, a significant speed up and reliability gain can be had by wrapping your “probe” in a TSX section.

 

Using Branch Target Buffer

The fourth and fairly different attack came using the Branch Target Buffer, BTB.  Evtyushkin, Ponomarev & Abu-Ghazaleh [5] found out that intel caches branch targets in order to better branch prediction in BTB. Correctly predicting branches is important to performance because a failure to predict correctly will under utilize the entire instruction pipeline. Intel however is cheap with the bits and uses only the lower 31 bits to store a branch target in the cache. Since KASLRis typically build on entropy in the lower 31 bits and these bits are shared between kernel and user mode, an attacker can cause conflicts. These conflicts will cause measurable timing differences.

 

Features

Despite the first three methods evolving around the same theme they do have significantly different advantages and caveats.  The unique BTB method obviously also have diverging properties. I shall stress the advantages and problems of each vector using Hund, Willems & Holz paper [9] as a baseline. This classical method now appears mostly obsolete. That said I consider the paper a milestone for many reasons and it’s well worth a read.

The Advantages of the prefetch vector

Using the prefetch instruction is, as noted above, faster and more reliable. Not only does it make it more reliable, but also more accurate. Without the noise one can spot a timing difference depending on the level in the page table hierarchy an abort was caused. This allows an attacker to efficiently map out the entire address space top down starting out with 1 gigabyte regions. Further the prefetch instruction actually ignores the page table security settings and prefetches an address into a cache. This allows an attacker to recover physical addresses and by pass SMEP & SMAP when the operating system is using a direct mapping.  The prefetch instructions are available on all current CPU’s, as they were introduced back in 1998.

The TSX vectors’s advantages

Like the prefetch attack, no fault handler is called and consequently the attack has significantly reduced the noise. The TSX attack, unlike the prefetch, can trigger code access through execution using a jump or call instruction. This has significantly different timing properties and consequently allows the attacker to distinguish between code and data. As a downside it cannot use the cache loading properties of the prefetch instruction. A major drawback of the TSX vector is the installed base of TSX which is somewhat limited. It was officially introduced with Haswell micro architecture. To this day TSX is only available in high end computers: some I5, I7 and xeon’s. This does not yet include the U series which is the most popular processor for notebooks.

BTB

The BTB attack is probably the most limited attack, as it is only able to locate code and data where branches give away the address. For all its limits in this regard it is probably the fastest of the attacks. Further Felix Wilhelm [11] showed that since the attack does not rely on page tables, it translates to host-KASLR breaks too.

Software Mitigation

Intel has in the past not been prone to do anything about micro architectural attacks and consequently I have significant doubt that they will with KASLR attacks. In fact the segment limits that were part of the x86-32 architecture would be a shop stopper for the first three attack vectors, but was removed for performance with x86-64.

It is often suggested to make the timer (RDTSC(p) ) less accurate will mitigate micro architectural attacks which covers all 4 attacks vectors here. The cost is obviously some implementation overhead and a less accurate timer for benign purposes. It could be implemented through the CR4.TCD bit for bare metal operating systems or setting up the MSR’s to get a VMEXIT in hypervisor environments.  These cost seems reasonable to me – I’ve yet to hear a convincing argument why user mode applications in general, need such timing accuracy and I never heard a convincing argument that any application would be significantly slowed down by the overhead required to get less accuracy on RDTSC(p) instruction. That said lower timing accuracy is not a good mitigation in cases where an attack can be repeated. The reason is that once an attacker can repeat a timing attack, he can use the law of large numbers to cancel out any noise or bias added to the timer. For KASLR attacks, there is little reason why an attacker cannot repeat the attack and in fact the Hund, Willems & Holz [9] attack does this to great effect, not because of a noisy timer, but because a the noise added by the page fault handler.  I thus think it’s safe to conclude that a noisy timer cannot be a solution to KASLR micro architecture attacks.

The Hund, Willems & Holz [9] attack is fairly easy to mitigate. Since they actually trigger a page fault, one has a nice call back in the page fault handler. A few lines of code can use client CS and error code on the stack and the linear address in CR2 to check if a user mode program tried to access kernel memory and respond accordingly. As this event should be really rare, the response can be anything from causing a large delay making the attack impractical or shutting down the application.

 

The in Jang et al. [2] version of the TSX attack ,DrK, use page mapping (number of continuously mapped pages) and page attributes to identify modules. This can easily be thwarted by mapping unused pages in the driver region to a single PFN with only null data. The cost shouldn’t be too bad as only one PFN needs to be referenced mapped from a number of PTE which to some extend can use existing PDE’s. Exactly how many additional PDE’s would be needed to thwart the DrK would be a tuning parameter. This method would also slow down the prefetch attack, however currently it uses a function call to load the caches and thus cannot be completely thwarted in this fashion. It is likely that the TSX vector could be made to work in a similar fashion.

Both the TSX vector and the BTB vector is vulnerable to performance counter detection as there are currently performance counters that maps to micro events caused by the attack. Say for instance the detection I suggested for the Flush+Flush [7] attack could easily be modified to detect both these attacks. The prefetch attack however, does not seem to have a performance counter event that correlate sufficiently with the attack. The performance counters on software prefetch’s was discontinued and remapped with the by now very old Silvermont generation of processors. These kinds of detections are unfortunately either high-overhead or false positive prone.  Consequently they don’t off themselves as general solutions, but rather of a close gap nature.  Without a good performance counter mapping to the prefetch instruction, reliable detection would likely be prohibitively expensive performance wise using these methods.

The BTB and TSX vectors offer other ways of detection.  In particular it seems likely that both will generate activity in the Last Branch Record (LBR) stack. If one were to instrument the RDTSC(p) instruction, as suggested above, the distance between the last RDTSC(p) instruction and the current combined with the LBR would provide clues to nefarious activity quite possibly at a fairly low cost.

More promising is a BTB only trick that relies on the very thing that makes the BTB attack work in the first place. The BTB vector works because only the lower bits are used to store the branch target and thus you get conflicts between kernel and user mode branches. Interestingly this conflict consequently only gives you information on the lower 31 bits and Evtyushkin and Ponomarev [5] notes that using the upper bits for KASLR thus remains a solution. Using all the remaining addressing bits f or KASLR seems a good idea, however all modern operating system sorts different types of code and data into blocks and partically uses these blocks to identify maglinant behavior. For instance drivers in Windows are loaded between 0xFFFFF800’000000000 and 0xFFFFF8FF’FFFFFFFF. Thus to minimize changes to this general structure using only the upper 9 bits (the PML4 bits) would probably be the easiest solution. This would give you 256 different locations which the BTB method could not tell apart (the top most bit is used to distinguish between kernel mode and user mode). Not much, but an attacker is unlikely to get away with causing a machine a 100 times to get it right be random. In fact Windows 10 Anniversary Edition does this for PTE’s, but not for code. See Economou [10] for more info.

The three vectors based on actually accessing kernel memory can be mitigated through address space separation. By treating kernel mode as a separate “process” and swapping the CR3 register immediately after/before a ring change, one can effectively block these attacks. Depending on the amount of ring switching going on, an operating system are looking at about 5% over head as switching the CR3 registers are expensive in and of itself and because it invalidates the paging caches. Both Yang et al [2] and Gruss et al. [6] mentions this method.

A mitigation that I considered for thwarting the BTB attack was flushing the BTB itself, by systematically doing a number of jmp instructions on switching back into user mode. This method is likely to have too much performance overhead to be practical. The reason for this method is the size of the BTB, which Godbolt [4] estimates the BTB in his Haswell to be 2048 entries large – thus requiring 2048 dummy branch instructions on every transition into user mode.

Conclusion

The natural question, which of these methods is the best, is a misunderstood question. While the first method appears obsolete by now, the other 3 methods have different advantages and disadvantages. They base themselves on at least 2, but probably at least 3 different side channels in the CPU and thwarting all four attacks seems like a difficult undertaking. Personally I find it likely that there are more leaking in the processor to be found – maybe I’ll write about that another time. The natural conclusion is that KASLRfor bare metal operating systems seems to be broken for defending against privilege escalation mounted from native code. The caveat here is the “from native code part”. Classical data based attacks, such as a font attacks of the past cannot be doctored to break KASLRusing these micro architectural attacks. Before I cry any crocodile tears ,I should note that KASLR was always a stop gap solution and consequently mitigations become mitigations for a mitigation. The real problem being bugs in kernel mode – though very difficult to fix all kernel mode bugs, minimizing code in the kernel and improving kernel mode code quality must remain the focus of making the kernel secure.

 

Literature

[1]A Barresi , K Razavi, M Payer, T Gross: “CAIN: Silently Breaking ASLR in the Cloud“ – https://nebelwelt.net/publications/files/15WOOT.pdf

[2] Y Jang, S Lee, T Kim , “Breaking Kernel Address Space Layout Randomization with Intel TSX”

– Proceedings of the 23rd ACM Conference on  Computer and Communications Security

[3] R Wojtczuk.” TSX Improves Timing Attacks Against KASLR.” https://labs.bromium.com/2014/10/

[4] M Godbolt. “The BTB in contemporary Intel chips.” http://xania.org/201602/bpu-part-three

[5] Evtyushkin D, Ponomarev D, Abu-Ghazaleh N. Jump Over ASLR: Attacking Branch Predictors to Bypass ASLR. InProceedings of 49th International Symposium on Microarchitecture (MICRO) 2016.

[6] D Gruss, C Maurice, A Fogh, M Lipp, S Mangard . “Prefetch Side-Channel Attacks: Bypassing SMAP and Kernel ASLR.” Proceedings of the 23rd ACM Conference on Computer and Communications security 2016

[7] A Fogh. “Detecting stealth mode cache attacks: Flush+Flush. http://dreamsofastone.blogspot.de/2015/11/detecting-stealth-mode-cache-attacks.html

[8] A Fogh: “Breaking KASRL with micro architecture Part 1”, http://dreamsofastone.blogspot.de/2016/02/breaking-kasrl-with-micro-architecture.html

[9] R Hund, C Willems, T Holz. „Practical timing side channel attacks against kernel space ASLR” Security and Privacy (SP), 2013 IEEE Symposium on, 191-205

[10] N Economou.  “Getting Physical: Extreme abuse of Intel based Paging Systems” -https://blog.coresecurity.com/2016/08/25/getting-physical-extreme-abuse-of-intel-based-paging-systems-part-3-windows-hals-heap/

[11]  F Wilhelm “mario_baslr” https://github.com/felixwilhelm/mario_baslr/

The Kings In Your Castle Part #1

All the lame threats that own you, but will never make you famous.

In March 2016 I presented together with Raphael Vinot at this year’s Troopers conference in Heidelberg. The talk treated research of targeted malware, the how’s and if’s of malicious binaries, the involvement of sophistication and exploits, the presence or absence of patters within advanced persistent threats (APTs). The write-up of all happenings has taken its time, and as expected kept growing longer and longer and.. finally we figured splitting the outcome into multiple posts would make all our lifes significantly easier.

Five compact blogposts are the result, easy to digest and understand, while covering the following points:

  1. Introduction, hypotheses and analysis process with respective toolset
  2. Description of the data set and the feature extraction process
  3. The findings’ curiosities and results of exploit-per-APT analysis
  4. The use of packers, crypters and commodity RATs within the sample set
  5. Actor correlations and APT naming wars, future research

Here we go, please enjoy part 1 about the kings, your castle, and cyber. At this point I would like to thank Raphael Vinot for joining in on this data shoveling excurse, the team of CIRCL being busy feeding and maintaining their data dragon, and Morgan Marquis-Boire as well as the many other wizards from the research community who kept sending us malicious software.

Part 1: The kings, your castle, and cyber

It is the same question being directed to audiences around the security conference scene: How many people in the room can tell their machine or network is currently not compromised? No hand has been seen to rise in answer.  APT has been fashion five years ago and still rocks the most-feared charts on every cyber threat survey. While tabloid press is generally after the latest most-sophisticated-threat, the analyst community has long resorted to talk about threats that are advanced and persistent.. enough. In terms of sophistication targeted attacks show all shades of grey, on average though tend to be rather shallow. On the other hand, security products all have a single weak spot in common that they will always rely on patterns; whether patterns that are there, like signatures, or patterns that are not there, like anomalies. This enables attackers to evade detections with shallow, but unknown tools which manage to fly under the radar.

In research we conducted in cooperation with CIRCL, the Computer Incident Response Center Luxemboug, we took on the APT myths by formulating hypotheses based on a set of APTs documented in the MISP platform. MISP stands for Malware Information Sharing Platform and is used by hundreds of organizations to share data on incidents, among them a large number of targeted attacks. It is possible to split the content of the information shared between reports of vendors and events seen by the users of the platform. MISP is maintained and developed by the fine people at CIRCL, data is constantly added by CIRCL members, security companies or independent contributors.

Having this information in one single place allows to correlate -supposedly- new threats reported by vendors with existing events seen in the wild now or in the past. At the time of conducting the research, MISP held information about more than 2.000 events.

The data contained helps understand the overall nature of the threats, the tools of trade, the preferred approaches of the attackers, and their evolution. It potentially even allows for actor tracking as the correlation of attributes reveals hidden treasures.

The gathered events from MISP are pre-classified by threat level. We concentrated on targeted threats and conducted a survey on the nature of malware and infrastructure used therein. How much of the analyzed malware is custom made, how much off-the-shelf or simply installs known RATs? How much of it is packed or crypted? Does the fact that malware is not crypted allow conclusions on whether it is used for targeted attacks? How often are exploits used in attack? Does the use of exploits imply more sophisticated tools as the attacker is expected to dispose of higher resources?

On a more abstract level, we also wanted to know if we were able to uncover links among actors, their tools and infrastructure, solely based on OSINT data (open source intelligence).

The reason why this is possible lies in the nature of targeted attacks in general. A targeted attack in reality is not a piece of malware. It is a process, consisting of several phases, and many times even a repetitive circle if one finds himself targeted by a determined attacker.

howapthappens

Figure 1 – The APT process

Furthermore, the stages involving malicious software frequently imply the use of different pieces of malware, and also a certain infrastructure to operate attacks. This means, the attackers require servers, domains and maybe even e-mail addresses or fake social media accounts and fake web sites; to orchestrate the malware, store exfiltrated data and to drive hands-on attacks. All of these items come with a certain cost, in time and money, and believe it or not, most attackers have restrictions – in time, and money.

Limited resources call for repetition, or, in other words, attackers with a hint of economical thinking won’t reinvent the wheel anew every time a new target is chosen. Advanced attack suites, exploits, tailored malware, techniques and practices are assets that are costly to change.

This said, reinventing the wheel is possible, but not a very established practice. Besides requiring extensive resources, building and maintaining more than one attack platform and a large set of unique attack tools is prone to errors. After all, the folks building sophisticated malware are human software developers too. Getting back to the actual cost of an attack, being discovered is not only annoying for an attack group, it is also expensive and potentially dangerous. Exposure might get law enforcement on track or even inspire counter attacks.

Enough about the motivations though, APTs will keep being APTs; so let’s go on with exploring their malware and infrastructure.

Toolification

MISP is a system that is usually fed with threat indicators, which are then shared and internally correlated with other indicators already present in the database. The primary purpose of MISP is of course the distribution of threat indicators, aided by additional information gained by indicator correlation. As by-product though, MISP provides a decent event documentation with a timeline.

That said, obviously we didn’t just walk in and query databases; the process of gathering data, cleaning up records, writing tools and forming theories that got again discarded was lengthy. In the end what is left is a neat sample set, a ton of data and a set of homegrown tools as well as extensions for already existing tools.

Please note though, this project is by no means an effort to perform large-scale clustering of malware with machine learning methods, nor does it involve any whatsoever sophisticated algorithms. The tools we designed are merely simple, the grouping of samples a mere sorting approach. Keep it simple, then scale it up, a data analysis wizard once told me.

toolification

Figure 2 – Tools and extensions created during data processing

MISP provides a web interface, which helps in manually investigating one or more events and look for correlations, but it does not serve an automation purpose. Therefor we established different ways to access and process the MISP dataset. We used a Python connector to shovel MISP data into Viper, a malware sample management framework. This way we were able to easily sort through events and selected the ones which, based on predefined criteria, were highly likely involved with targeted attacks. These events were the base from where we started processing. Selection criteria and musings about attack nature will be outlined in a follow up blogpost. To sketch the workflow, it went roughly the following way:

  1. Event data from Viper -> sample hashes -> sample processing / tools development with SQLite backend
  2. Event data from MISP -> pull to Redis backend -> event attribute processing
  3. Importing of sample processing data into Redis
  4. Data correlation and analysis with Redis

We faced a challenge when seeing to collect all the malware binaries, involved with events that were selected for further processing. MISP holds primarily threat indicators, including sample hashes, domains, e-mail addresses, file names and occasionally Yara signatures; rarely ever binaries. Also, malware, especially when involved in targeted attacks, is not always shared with the broader public. Eventually we managed to aggregate most of the fancied binaries through help of public and private repositories.

The use of two different backend solutions has historical reasons, mainly we started off to work independently with our preferred solution, and in the end found there is no reason to abolish either. Redis is a strong, scalable backend, suited for medium to large scale data analysis; SQLite is practical in portability, small, elegant and effortless to set up.

For feature extraction from binaries we used the Python library pefile and instrumented IDAPro with help of IDAPython. Furthermore, we made use of the ssdeep fuzzy hashing library, exiftool for detailed file type analysis and used RapidMiner community edition for visualization.

Our developments were published in the course of Troopers conference and are available on github at as part of the MISP repository.

The next blogpost of the Kings in your castle series will cover the nature of the analysis data set and include a discussion of the extracted feature set.

Covert Shotgun

Automatically finding SMT covert channels

Introduction

In my last blog post I found two covert channels in my Broadwell CPU. This blog post will again be about covert channels. For those unfamiliar a covert channel is a side channel where the attacker has an implant in the victim context and uses his channel to “smuggle information” in and out of the victim context across existing security boundaries.

In this blog post I’ll explore how we can automate finding SMT covert channels. SMT is intel speak for hyper threading.

Before I proceed I should note that one of the two covert channels I found in my last blog passed, the one based on the RdSeed instruction, appears also to have been found by others. You can read about it in D. Evtyushkin & D. Ponomarev [1]. They will be presenting their work on this channel at CCS. Unlike myself they develop the channel fully and discuss mitigations. So if you find this channel interesting their paper is well worth a read.

 

SMT and side channels

For a long time, I’ve wanted to look for side channels in SMT. SMT is the technical name for hyper threading. Back in Pentium 4 Intel figured that most stuff that goes on in a CPU core is used relatively seldom, so they had the idea that you could run two hardware threads in the same core if you duplicate only the most important subsystems of the core. The remaining subsystems would be shared between the two hardware threads. In the event that the two threads simultaneously want to use a shared subsystem, one would just have to wait. In short you gain significant amount of performance, for a little bit of infrastructure. The problem is if you read what I just wrote sideways, it got “side channel” stamped all over it. In fact, there has been a number of example of side channels related to SMT found in the past. For instance, Acıiçmez and J.-P. Seifert [2], D’Antoine  [3] and my last blog post had neat little SMT side channel in the pause instruction – just to mention a few.

 

Covert Shotgun

The pause side channel I found by reading the intel optimization manual. This is the kind of work one usually do to find these critters. Sort of like a sniper looking for targets. But I figured it might be possible to automate the process – more like shooting a shotgun and hope you hit something. Hence my tool for doing so Is called covert shotgun.

The basic idea is if you have 3 instructions. One that you measure the latency of and two which you use to send a signal. The instruction where you measure the latency I call the receiver instruction. Of the two signal instructions one must use a subsystem that is not used by the receiver instruction, whereas the others should not. The receiver’s instructions latency will tend to be smaller when subsystems aren’t shared and we can encode this smaller latency as a 0 bit. To automated the process shotgun picks 3 instructions from a list of interesting instructions, measures latency on receiver instruction while running the two signal instructions respectively.

 

To implement this, I use the keystone assembler engine to generate receiver and signal code. The signal code is optimized to maximize pressure on any subsystem used by signal instruction:

A:

<signal_instruction>

<signal_instruction>

<signal_instruction>

<signal_instruction>

<signal_instruction>

Jmp A:

I repeat the instruction 5 times to minimize the effects of the jmp instruction. The receiver instructions latency is measured like this:

Cpuid

Mfence

Rdtscp

Mov rdi, rax

<receiver instruction>

rdtscp

sub rax, rdi

ret

With a list of instructions, I can pick a receiver instruction and two signal instructions and test the receiver with both signals. If there is a significant difference, I have a manifestation of a covert channel. Covert shotgun uses plain brute force on the list which makes it really slow and stupid  (N2 tests) – but for small lists works well enough.

As for what makes out a significant difference my initial idea was to do a classic P-test on 1.000.000 time samples to determine if there was statistical evidence to conclude there was a channel.  This turned up side channels that probably really does exists, but would be thoroughly impractical due to noise. The modified check is an average timing difference of at least 1 clock cycle or 1 standard deviation whichever is larger. Where I estimate the standard deviation the square root of the sum of the sample variance of the two. This insures should insure that the channel is practical. I acknowledge that the standard deviation estimator isn’t “blue”, but given the ad hoc nature of “practicality measure” it is not an issue.

Now the smart reader would’ve noticed I used the wording “manifestation of a covert channel”. I generally tend to only count something as a separate side channel if the system where the contention is caused is unique. With this method method we are likely to find several manifestations of the same side channel. However, it’s worth noting that different manifestations might have different properties and consequently any manifestation could turn out to be useful. For example, flush+reload and flush+flush would under this definition be the same side channel. But Flush+Flush is faster and is more stealthy than flush+reload. See my previous blog post on cache side channel attacks [4] and Gruss et al [5] on flush+flush.

Also worth noting is that cover shotgun will only find side channels of the instruction timing type, and only when the instruction parameters are inconsequential. Finding a side channel like flush+reload or even flush+flush is not possible in this fashion.

Strictly speaking cover shotgun can find covert channels cross core. The reality of it is that cross core covert channels of the instruction timing type is rare. I’ve only found the RdSeed channel so far.

The hardware

harddiagram

A priori I expect a significant amount of covert channels. To understand my a priori expectation let’s spend a second reviewing some hardware design. Below is a very simplified “map” of the instruction execution logic in my Broadwell intel cpu’ Specifically, I’ve tried to simplify by remove the memory subsystems from the diagram as covert shotgun currently is not usable to analyze this subsystem. Further I’ve sanitized the registers away for simplicity as they are not important for covert shotgun to the extent that registers are duplicated for the two threads.

The first step in execution is parsing instructions into myOp. There is a cache for myOp instructions which could probably be used for cache side channels – if I recall correctly I saw a paper on it. Covert shotgun currently only uses a handful of instructions at once and thus is unlikely to be able to find side channels in this cache. The cache however is particularly interesting as it would give information on instructions in the cache.

Once parsed the myOps are reordered to maximize parallel execution.

The reordered myOPs then enter a queue for the ports. Intel speak for this queue is unified reservation station or sometimes just reservation station. Intel architecture connect individual execution units through these ports.

Each port is capable of handling only one myOp per cycle meaning that it’s fairly easy to run into congestion in the ports.  Each of the 8 ports is connected to up to 7 individual execution units which each executes particular specialized tasks. Some execution units are duplicated and available on multiple ports, some execution units are available on only one port – this very well reflects that the CPU was designed for hyper threading. This would lead us to expect to find a high number of manifestations where the receiver instruction and 1-bit sender instruction are identical – with reordering and multiple execution units making sure that it occurs less than 100% of the time. For 0-bit instructions we’d expect a speedup when running parallel with itself. The main point of congestion is probably through the congestion in the individual execution engines – especially those who are not duplicated.

Until Haswell intel had 6 ports and Haswell added an additional 2 ports, going to a total of 8. This allows further parallelization of execution. I think Sky lake added even more ports and duplicated an integer execution unit for better parallelization’s of integer and vector instructions, but I’m not sure. More parallelization typically means less covert channels. These changes should serve as a reminder that the covert channels you find on say a Broadwell, need not exist on a Sandy Bridge and vice versa even when the instructions exist on both. The microarchitecture is subject to change and these changes has significant effects on covert channels and other micro architectural attacks.

Results

Due to the N2 nature of “covert shotgun” I ran a small list of 12 instructions where some were picked because I suspected covert channels in relation to them, and some were chosen because I found the particularly likely to be boring. The list of instructions

Instruction My reasoning
RdSeed rax Pretty slow, thus like a “0” signal instruction
Pause The most likely  “0” signal instruction
Nop Does it get anymore benign?
Xor eax,eax I used it in my last blog
Lea rax,[4*rax+edi+40960];lea rdx,[8*eax+rdi+409623] I suspected I might get into trouble with the Address generation unit, by using SIB bytes, two instructions and a constant bigger than 2048.
RdRand rax Sounded interesting
Add rax,1
Bts rax,1
Bt rax,1 Wanted to check too near identical instructions against each other
Cpuid An interesting instruction on any day
Movq xmm1, xmm2 Streaming instructions must be tested
Vtestps ymm1, ymm2 More streaming instructions…this time 256bit wide

 

Running these 12 instructions in brute force meant testing for 936 covert channels. In these I found 521 practical covert channel manifestations. That’s a whooping 55%. Now I expected there to a be a lot of manifestations in there, but this magnitude surprised me. Personally I don’t think the detailed results are that interesting. The thing is that I think SMT in and of it’s own should be considered the side channel here as opposed to considering congestion in subsystem below the SMT as the core reason. However, I will give you a summary of a few interesting finds below.

covershotgunmatrix

RdSeed turned out to be an extraordinary good “0” signal bit. All instructions run faster in parallel with this instruction than other instructions. No less than 46 of the covert channel manifestation was with RdSeed. With the RdSeed taking around 300 Clk’s to finish, it is likely that this instruction to a very large extend does not interfere with the execution of other instructions, leaving every shared resource mostly free for maximum speed. The pause instruction has the same effect, but not quite as strong and not quite as surprising as RdSeed (Part of the reason for the pause instruction is to yield for the other hw thread). Most likely it’s because the pause instruction has significantly lower latency on my computer and thus more overhead than the RdSeed. If I recall correctly the pause instruction pauses significantly longer on skylake. Interestingly enough the RdSeed instruction also have a significant speedup effect on itself. In my estimation it’s a faster speed up than can be explained with in the pipeline and the most likely reason for the speed up is due to the “seeding unit” being busy and the instruction thus terminating early.

Most of the 1-bit signal instruction instructions does seem to have a high latency when paired with itself than with the average instruction in my test. With the limited number of instructions and the serious selection bias involved I advise against any conclusion.

Unfortunately, I messed up the “lea” test in that I had dependency between the two lea’s and consequently I need to rerun the test. Rerunning the test with lea’s without dependencies did not appear to change the results significantly.

Future research

It would be interesting to run covert shotgun (Preferably made a bit smarter) on much larger lists of instructions to get a much better overview of the situation. Another interesting project would be identifying the subsystem which are being congested by specific instructions. A good starting point would be Agner Fog’s instruction tables[6] and the associated software. This software leverages performance counters to identify port usage and lots of other juicy instruction information.

Further it would be interesting to investigate to what extend these covert channels extend to spying. Especially I found the covariance in instruction timing between different instructions really interesting – there certainly are patterns in there. They suggest that you have congestion on multiple subsystems and perhaps – just perhaps – there is enough data in that to finger print code across SMT. That would be especially interesting say in a SGX scenarios or would lead directly to spying in branched victim code.

A word on microarchitecture timers

Most side channel attacks are timing attacks and thus subject to mitigation through modified timers either by intention or in cases where a hypervisor for other reasons decided to get a VMExit on Rdtsc instructions. For this reason, I’ve been searching for a fast timer that isn’t rdtsc and so far not found one. A natural candidate would be a fast loop synchronized through either overt or covert access. Overt access through memory access was in my experiments too slow to useful. Probably the reason is that a store operation is required that triggers a cache coherency action. So I wanted to see if the RdSeed channel could be used. Unfortunately, the answer is a clear no with RdSeed taking upwards of 200 CLK’s and micro architectural attacks often require accuracy below 70 CLK’s. That said – I have this other idea which I hope I’ll get around to soon

Conclusion

Despite my limited time with covert shotgun it’s has shown great potential. Also I think this simple experiment efficiently shows the dangers of SMT if it wasn’t clear to everybody already. In my opinion it’s justified to talk about SMT being the core cause of all these side channels (and other SMT channels as well) and consequently it’s not very useful to talk about “a pause side chance” as a separate side channel, but as a manifestation of the SMT side channel. With the number of manifestations and different available subsystem the only reasonable mitigation left against these covert channels seems to turn off SMT entirely in threat models that includes attackers having an implant in your VM. Personally I think the root cause is that the attacker is in your VM in the first place. If these methods can be used for spying it is very likely that some can be implemented in java script and thus be used for remote attacks.

 

[1] Dmitry Evtyushkin, & Dmitry Ponomarev , “Covert Channels through Random Number Generator: Mechanisms, Capacity Estimation and Mitigations”hhttp://ttp://www.cs.binghamton.edu/~dima/ccs16.pdf

[2] O. Acıiçmez and J.-P. Seifert., “ Cheap Hardware Parallelism Implies Cheap Security. Fault Diagnosis and Tolerance in Cryptography” (FDTC’07), pages 80-91, IEEE Computer

Society, 2007.

[3] Sophia M. D’Antoine, “Exploiting processor side channels to enable cross VM Malcious Execution”. http://www.sophia.re/SC/thesis.pdf

[4] Anders Fogh, “Cache side channel attacks”,http://dreamsofastone.blogspot.de/2015/09/cache-side-channel-attacks.html

 

[5] Daniel Gruss, Clémentine Maurice & Klaus Wagner(2015): “Flush+Flush: A Stealthier Last-Level Cache Attack”, http://arxiv.org/pdf/1511.04594v1.pdf

[6] Fog, Agner, “4. Instruction tables “ http://www.agner.org/optimize/instruction_tables.pdf

 

Two covert channels

Introduction

Too much WinWord. Too much Tex. Too many meetings. Too little CPU. It was time for a short pause from the grind and dig into some tetravalent metalloid. My current project was too big a mouthful to get into before going to Black Hat, so I dug up a pet project to play around with. And then it happened – I needed some info from the intel documentation and before you know it had what I believe is a new side channel in my head. Then a second one. Then a third. Two of these side channels looked like they would make for good arguments for a point I’ve been meaning to make. The third is more complex and it might be fairly useful. Consequently, I decided to write this small blog post on the two first side channels and make my point. Both of these side channels are technically interesting, however only the second is likely to be of any practical importance. I have not spent too much time or effort researching these channels as they are just a side project of mine. Beware: I’ve not checked the literature for these two particular channels, so I might be repeating other people’s research. I do not recall having read anything though.

Not all side channels are equal

Not all side channels are created equal. Some are inherently more powerful than others. However, there is no good way to measure how good a side channel is – it simply often comes down to the subjective question of what do an attacker need the channel for.

The simplest use for a side channel is the so called covert channel. Most microarchitecture side channels can be used to communicate between two trust domains when an attacker has access to both. An attacker can use covert channels for data exfiltration. The most important two reasons are to stay hidden or because regular channels are not open. An example of a covert channel would be an implant on a virtual machine in the cloud that communicates through, say a cache side channel, with another VM on the same physical computer. In this scenario the attacker doesn’t leave any TCP/IP logs and can communicate with his C2 server despite the defender turning off his network adapter.

While most microarchitecture side channels can be used as a covert channel, some side channels provides insights into normal system behavior and thus are powerful enough to be used for spying even when the attacker does not have access to both attacker and victim’s trust domain.

In this blog post we’ll be looking at two side channels which falls into this category of side channels, which are mostly irrelevant for spying but useful for covert channels. Since these channels are probably limited in this way I have refrained from researching them very deep. The point of this blog isn’t the two new side channels, but a different one. However both side channels are  from a technical view point interesting. The second side channel might even be practical.

 

Time for a pause – side channel with the pause instruction

Being old one forgets the name of the instructions from time to time and I needed a pause instruction and was searching for “wait”. Anyhow I picked up the optimization guide because I know the right name would be in there. And so it was, but then I read the description again and found the pause instruction has a profound impact on hyper threads because it does not significantly block resources of the other hw thread on the computer. But that has probably been utilized by somebody already. But more interesting if two hyper threads both use the pause instruction simultaneously power can be saved.  One has to wonder if that makes difference in the execution time of the pause instruction. And it does. But the story is longer. The latency of the pause instruction is heavily dependent on power management settings and the current activity of the computer. So the channel depends on those settings. Further Intel has been tinkering with the pause latency, so you’re likely to see different results on different computers. My results are from an I3-5005U, I did not test on other computers. My power settings where “Balanced”. Notice we are talking about a laptop here, which might be important too. Below a plot of latency measurements of the pause instruction with a core-co-located thread doing a loop of pause instructions and with a identically co-located thread doing a normal integer operation (xor eax,eax – did test with other instructions with similar results). The plot is for a loop of 40.000 measurements.

pause_latency

Usually we’d expect a speed up when a collocated hw-thread would pause, because the thread effectively yields all resources to the other hw-thread in the core and in the meassurements there is a bit of overhead that could be sped up. However we are seeing the exact opposite. I can only speculate about the reason, but it seems likely that with both co-located hw-threads in pause mode, the entire core goes into pause mode.That is the c-state of the core is increased and this comes at a small latency cost. Whatever the cause is, we have a measurable side channel. There is not much to spy on with this channel we could possibly spot when the kernel is doing a spinlock for some reason, but really juicy information does not seem to be a likely catch in the side channel. Since the channel depends heavily on power settings and on computer activity the channel might not be very useful in practice. Further the core-colocation requirement obviously detracts from value of the side channel.  Also it’s a timing side channel using rdtsc(p) instruction which gives leverage for detection as well as mitigation through timing inaccuracy. In fact this attack would be difficult to get to run under VirtualBox (Ubuntu host) as this virtual machine is setup to cause a vmexit on rdtsc which makes the measurement of such small time differences difficult (Though not impossible). Even worse for the value of this side channel is that pause itself can be made to cause vmexit and thus be used for direct detection/mitigation of any pause based side channel attack. So while we have a side channel here, it’s not likely to be a very practical one.

 Sowing a seed too many – side channel with rdseed

“Under heavy load, with multiple cores executing RDSEED in parallel, it is possible for the demand of random numbers by software processes/threads to exceed the rate at which the random number generator hardware can supply them. This will lead to the RDSEED instruction returning no data transitorily.”

 

That sounds like a very obvious side channel indeed. The RDSEED instruction has been added with the Broadwell micro architecture and uses heat entropy in the CPU to generate cryptographically strong random number. The problem with this instruction seems that for enough entropy to build up a bit of time is needed between calls to RDSEED. Thus intel designed the instruction to return an “error” using the carry flag if insufficient time has passed since the last call of RDSEED. So the basic idea is that an attacker could create a covert channel using this instruction. To send a 1 bit the sender implant loops an rdseed instruction and mean while the receiver runs a loop spaced with plenty of time between rdseed. The information is extracted in the recievesr’s end from a count of failed rdseed instructions. My simple test setup was an infinite sender loop which either called the rdseed instruction or not depending on the bit I wanted to send. My receiver looped 1000 times around did an rdseed instruction followed by a  10 ms Sleep() call.  0 bits caused zero failures in the receiver loop and typically around 800 failures in the 1bit scenario.  I tested only on a I3-5005U Broadwell laptop, but with the sender and receiver thread pinned on same core as well as on different cores. Results where near identical.

 

This particular channel is interesting in so many ways, despite its limited use. It is a really rare thing that side channels in the CPU are not timing attacks based around the rdtsc(p) instruction. Infact I know of only one other attack: The instruction reordering attack of Sophia D’Antoine which if I understand it correctly is limited to scenarios that are not really reflective of the real world. This attack however works cross core without other limiting factors – however it’s limited to micro architectures that support the instruction which is only the newest intel CPU’s. Further it does not appear to cause any interrupt that would allow instrumentations and does not appear to be wired to performance counters that would allow detection thus making low impact software mitigation difficult. Finally, theoretically the channel could be used to spy on benign usage of the rdseed instruction, but probably wouldn’t gain much information.

The real point

The point of this blog isn’t that I found two new side channels. The point here is an entirely different one. The point is that I as a byproduct found at least two side channels in the processor of which at least one does not seem like it’s fixable in software. It might be in microcode, but I don’t think that’s relevant either.  The point I wish to make is that with the amount of side channels already known and me finding two while not looking for them, it seems likely that there’ll be a great many side channels waiting to be discovered. Not to mention the likelihood that Intel will with time add new ones (recall the second side channel arrived with the Broadwell microarchitecture). Even if we knew all the side channels we’d be hard pressed to come up with software detection/mitigation and it seems unlikely that Intel will fix these side channels as they’ve not done so in the past. Consequently we should probably accept the fact that covert channels in cloud computing is a fact of life and consider them an unavoidable risk. This of cause does not mean we shouldn’t continue to find and document the channels, since our only defense against them is if malware analysts are able to spot one from a malicious implant.

 

 

BlackHoodie #2 – We roll again :)

Last year I held a free reverse engineering workshop for women, mainly in the not entirely un-selfish interest to see more of them around in the whole security field. More about the motivations, the why’s and obstacles and how it turned out you can read up here, here and here. Looking back, I’m super happy with this little project and, leaning out the window a bit further, call it a big success.

That said, I gleefully announce BlackHoodie #2, the next women-only reversing workshop, to take place in Bochum, Germany the weekend of 19th + 20th of November 2016. This edition will be held in cooperation with Katja Hahn, a splendid binary analyst herself, and Priya Chalakkal, an up-and-coming hacker of all things; and comply to the same principles as last year. It will be free of charge, no strings attached, and aim to help femgineers entering a field thats not easily accessible.

Moreover, in a wonderful initiative community members announced their support of this year’s edition by covering the travel expenses of a BlackHoodie attendee. The US startup Iperlane and Thomas Dullien aka. Halvar Flake will cover the trip for a lady, who decides to come join the workshop. The lucky attendee will be randomly selected from the group of registered participants.

May there be oh so many participants 🙂 🙂 So here we go again..

Why women only?

Because a girl-to-girl conversation is so much more fruitful than a full classroom with only one or two women hiding in the corners. I’ve done so many things in my life where I was the *only* girl among X other participants, and I promise I’ve been hiding in the corners more than once.

For the gents it might not be that obvious, but it is not easy for young females who haven’t yet found their place in life to walk into a class room, a university lecture, an office or a conference room full of men. Who, generally speaking, very often very well seem to know their place.

I’ve had girls in my classes before, hiding and holding back although I am so certain they would have been capable to be so much better than what their final results showed. So yeah this will be women only, for every female should feel welcomed and encouraged to do her best and get the most out of it.

Why more women in low-level technical jobs in general?

  • It’s difficult. Mastering something difficult makes you happy. I want all of you to be happy.
  • It pays well. While money makes you also happy, what’s more important, it gives you courage and independence.
  • It keeps you busy. Lots of open job positions globally, even better, believe it or not it is addictive and you might even find yourself a new hobby.

Hardfacts

  • Its gonna be Katja, and Priya, and me, and a binary, and you, and plenty of debuggers
  • Online preparation assignments, 4 of them, over the course of two months prior to the workshop
  • Workshop 19./20. of November at G DATA Academy, Bochum Germany
  • No fees, no strings attached, all you have to do is get there
  • Please register with your name or nickname and a short note about your background at blackhoodie at 0x1338 .at

Prerequisites

  • Being female
  • Computer science background in a sense you understand programming logic, how a processor works and how an operating system works
  • A Notebook capable of running at least one virtual machine
  • A virtual machine, preferred WinXP 32-bit
  • Guts 🙂 (It is going to be a lot to learn in a very short time)

REGISTRATION:

Please register with your name or nickname and a short note about your background at blackhoodie at 0x1338 .at. About two weeks before the event you will be asked for a final confirmation of your participation.

Row hammer the short summary

 

Introduction

This is the first updated version of my original “Row hammer the short summary” blog post. As I had predicted the summer was going to be interesting in terms of row hammer and it certainly appears I was right about that. With so much going on I found it worthwhile updating this blog post to be in line with the latest developments and to fix up a few minor details.

 

Short version of how dram works.

Current DRAM comes in modules called DIMM’s. If you buy a modern memory module for your PC, you’re buying a DIMM. If you look at the DIMM most DIMMs will have chips on both sides. Each side of the DIMM is a rank. Each rank again consists of a number of banks. The banks are in the physical individual chips you can see. Inside a bank you’d find a two dimensional matrix of memory cells. There are 32k rows in the matrix and 16k or 512k cells per row.  Each cell stores one bit and consists of a transistor for control and a capacitor which stores charge to signify bit is equal to 1 and no charge when bit is equal to 0 (on some chips the encoding is reversed). Thus a row stores 8kb or 64kb of data depending on the exact kind of DRAM you have in front of you. When you read or write from/to DRAM an entire row is first read into a so called so called row buffer. This is because for reading automatically discharges the capacitor and since writes rarely rewrite the entire row. Reading a row into the row buffer is called activating the row. An active row is thus cached in the row buffer. If a row is already active, it is not reactivated on requests. Also to prevent the capacitors loose charge overtime they are refreshed regularly (typically every 64 ms) by activating the rows.

 

Row hammer introduction

This section is based on [1] Kim et Al. where not otherwise noted.

When a row is activated a small effect is caused on the neighboring row due to so called cross talk effects. The effect can be caused by electromagnetic interference, so called conductive bridges where there is minor electric conductivity in dram modules where it shouldn’t be. And finally, so called hot-carrier-injection may play a role where an electron reaches sufficient kinetic energy where it leaks from the system or even permanently damage parts of the circuitry.  The net effect is a loss of charge in the DRAM cell which if large enough will cause a bit to flip.

Consequently, it’s possible to cause bits to flip in DRAM by reading or writing repeatedly and systematically from/to two rows in DRAM to (active the rows) bit flips can be introduced in rows up to 9 rows away from these two “aggressor rows”. The 9 rows are called victim rows. The most errors happen in the row immediately next to an aggressor row. Picking the aggressor rows so they bracket a victim row is called double sided row hammering and is far more efficient that normal row hammering. Using two adjacent rows to hammer surrounding rows is called amplified single sided hammering and can be useful in exploitation scenarios. If the victim rows are refreshed before enough cross talk can be generated no bit flips is incurred. As a rule of thumb the higher the frequency of row activation the higher the probability of flipping bits.

It has been shown that bits can be flipped in less than 9 milliseconds and typically requires around 128k row activations. [3] Seaborn & Dullien has reported bit flips with as little as 98k row activations.

The problem occurs primarily with RAM produced after 2010. In a sample of 129 RAM modules from 3 manufacturers over 80% where vulnerable. With all modules produced after 2012 being vulnerable.  [4] Lanteigne showed that DDR4 ram is vulnerable too with 8 out of 12 sampled DRAMs was subject to bit flips. Further this paper showed that certain patterns in the DRAM rows where more likely to cause bit flips.

[21] Lateigne concludes that AMD and Intel CPU’s are both capable of row hammering, but that the most bit flips are encountered when the methodlogy is adapted to the underlying memory controller in the attacked system.

 

Physical addressing and finding banks and row

Obviously to cause row hammering one needs two addresses belonging to rows in the same bank. [2] showed that repeatedly choosing two random addresses in a large buffer would in a practical amount of time belong to the same bank and thus be useful for hammering in software.

An optimal solution requires that the attacker has knowledge of physical addresses. Even with a physical address an attacker would need to know how they map to dimm, banks and rows to optimally row hammer. [5] Seaborn used row hammer itself to derive the complex function that determines physical address to dram location for a sandy bridge computer. [6] Pessl et al. showed how to use “row buffer side channel attacks” to derive the complex addressing function generically and provided maps for many modern intel CPU’s.

To obtain physical addresses the /proc/$PID/pagemap could provide this information. However, /proc/$PID/pagemap, which is not available in all operating systems and no longer affords unprivileged access in most operating systems that do support it. This problem for an attacker remains to be definitively solved.

 

Row hammering with Software

To cause row hammer from software you need to activate memory rows, that is cause reads or writes to physical memory. However modern processors are equipped with caches, so that they don’t incur serious speed penalties when memory is read or written. Thus to cause row hammering bit flips it’s required to bypass the caches.

[1] did this using the clflush instruction that removes a particular address from the cache causing the next read of the address to go directly to memory. This approach has two down sides. First, since clflush is a rare instruction, validator sandboxes (such as NaCL of google chrome) can ban this instruction and thus defeat this attack. Second Just-in-time compilers and existing code on computers generally do not use this opcode disabling attack scenarios where jit compilers are used (such as javascript) or for the future using existing code in data-only attacks.

[7] Aweke posted on a forum he was able to flip bits without clflush – he did not say how, but it was likely using the same method as [8] which systematically accessed memory in a pattern that causes the processor to evict the address of the attacker row from the cache causing the next read to go to physical memory. Unfortunately, how to evict caches is CPU dependent and undocumented and despite [8] Gruss, Maurice & Mangard mapping out how to optimally evict on most modern CPU’s it’s not the most trivial process. Typically, this requires knowledge of the physical address discussed above as well as a complex mapping function for cache sets. It is however possible to approximate this either through using large pages or through timing side channels. Also it is slower and thus less efficient than the clflush version above. Since this vector does not require special instructions, it’s applicable to native code (including sandboxed code), java script and potentially other JIT compiled languages.

[9] Qiao & Seaborn found out that the movnti instruction is capable of by passing the caches on it’s own. Further this instruction is commonly used – including as memcpy/memset in common libraries and thus difficult to ban in validator sandboxes and lowers the burden for future row hammer as a code reuse attack. It remains to be seen if JIT compiled languages can make use of it.

Finally, [10] Fogh showed that the policies that maintains the coherency of multiple caches on the CPU can be used to cause row activations and speculated it would be fast enough for row hammering. Since the coherency policies are subject to much less change than cache eviction policies and does not require any special instructions this method may solve problems with the other methods should it be fast enough. This remain to be researched.

 

Exploiting row hammer

[2] showed that row hammer could be used to break out of the NaCL chrome sandbox. The NaCL sandbox protects itself against by verifying all code paths before execution and block the use of undesired activities. To avoid new unintended code paths to be executed the sandbox enforces a 32 bit aligned address for relative jumps and adding a base address. In code it looks like this:

and rax, ~31a

add rax, r15  //(base address of sandbox)

jmp rax

Bit flips in these instructions often cause other registers to be used and thus bypass the sandbox enforcing limits on relative jumps. By spraying code like the above then row hammer, check for usable bit flips and finally use one of these relative jumps to execute a not validated code path and thus exit the sandbox. Not validated code path can be entered through code reuse style gadgets.

The second and more serious privilege escalation attack demonstrated by [2] was from ring 3 user privileges to ring 0. Since adjacent physical addresses has a tendency to be used at the same time, CPU vendors map adjacent physical addresses to different parts of RAM as this offers the possibility of memory request being handled by DRAM in parallel. This has the effect that banks are often shared between different software across trust boundaries. This allows an attacker to flip bits in page table entries (PTE). PTE’s is central in the security on x86 and controls access writes to memory as well as mapping between virtual and physical addresses.  By repeatedly memory mapping a file many new PTE’s are sprayed and the attack has a good chance of hitting by random using row hammer. The attacker hopes that a bit flips so that a PTE with write privileges maps to a new physical address where another PTE is located. By strategically modifying this second PTE the attacker has read & write access to the entire memory.

[18] Bhattacharya & Mukhopadhyay. Used to Row Hammer to extract a private RSA 1024 bit key. Their attack used a combination of PAGEMAP, a cache side channel attack (prime+probe) and a row buffer side channel attack to find the approximate location in physical memory of the private key. Once located row hammer is used to introduce a fault in the private key, and fault analysis is then used to derive the real private key. This make the attack somewhat unique in that it’s the only attack so far that does not rely on any kind of spraying.

[19] Bosman et. Al. Breaks the Microsoft Edge’s javascript sandbox . First they use two novel dedublication attacks to gain knowledge about the address space layout. This allows them to create a valid looking, but counterfeit java object in a double array. They then find bit flips by allocating a large array and using single sided row hammering. The method used is similar to [8] but they also notice and exploit the fact that pages 128k appart are likely to be cache set congruent. Once they know where the bit flips are they can place a valid object at this address, that is so crafted that the bit flip will change it to a reference to the counterfeit object. Once this is set the row hammering is repeated and they now have a reference for the counterfeit object that can be used by compiled javascript. Further the object can be edited through the double array in which it was created and this allows arbitrary memory read and write.

[20] Xiao et al. The content of this paper is unknown to me, yet the title suggests that cross-VM and a new kind of privilege escalation is possible with row hammer.

It is likely that row hammer can be exploited in other ways too.

 

Row hammer mitigation

Most hardware mitigations suggest focuses on refreshing victim rows thus leaving less time for row hammer to do its work. Unfortunately, during the refresh ram is unable to respond to requests from the CPU and thus it comes at a performance penalty.

The simplest suggestion is increase the refresh rate for all ram. Much hardware support this already for high-temperatures. Usually the refresh rate is doubled, however to perfectly rule out row one would need to increase the refresh rate more than 7 fold [1]. Which in term is a steep performance penalty and a power consumption issue.

TTR [17] is a method that keeps track of used rows and cause targeted refresh of neighbors to minimize the penalty. The method needs to be supported in both CPU as well as RAM modules. MAC also known as maximum activation count keeps tracks of how many times a given row was activated. pTTR does this only statistically and thus cheaper to build into hardware. PARA [1] is another suggested hardware mitigation to statistically refresh victim rows. ARMOR [16] a solution that keep tracks of row activation in the memory interface.

It has been suggested that ECC ram can be used as a mitigation. Unfortunately, ECC ram will not to detect or correct bit flips in all instances where there are multiple bit flips in a row. Thus this leaves room for an attack to be successful even with ECC ram. Also ECC ram may cause the attacked system to reset, turning row hammer into a Denial of Service attack. [4] Suggests this problem persists in real life experiments. Keeping track of ECC errors may however serve as an indication that a system was under attack and could be used to trigger other counter measures.

Nishat Herath and I suggested using the LLC miss performance counter to detect row hammering here [11] Fogh & Nishat. LLC Misses are rare in real usage, but abundant in row hammering scenarios. [12] Gruss et al. ,[13] Payer refined the method respectively with correcting for generally activity in the memory subsystem. The methods are likely to present false positives in some cases and [11] and [13] therefore suggested only slowing down offenders to prevent bit flips. [14] Aweke et al. presented a method that first detected row hammering as above, then verified using PEBS performance monitoring, which has the advantage of delivering an address related to a cache miss and thus grants the ability to selectively read neighbor rows and thus doing targeted row refresh in a software implementation. [15] Fogh speculated that this method could be effectively bypassed by an attacker.

Litterature

[1] Yoongu Kim, R. Daly, J. Kim, C. Fallin, Ji Hye Lee, Donghyuk Lee, C. Wilkerson, K. Lai, and O. Mutlu. Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors. In Computer Architecture (ISCA), 2014 ACM/IEEE 41st International Symposium on, pages 361–372, June 2014.

[2] Mark Seaborn and Thomas Dullien. Exploiting the DRAM rowhammer bug to gain kernel privileges. March 2015.

[3] Mark Seaborn and Thomas Dullien. “Exploiting the DRAM rowhammer bug to gain kernel privileges”. https://www.blackhat.com/docs/us-15/materials/us-15-Seaborn-Exploiting-The-DRAM-Rowhammer-Bug-To-Gain-Kernel-Privileges.pdf

[4] Mark Lanteigne. “How Rowhammer Could Be Used to Exploit Weaknesses in Computer Hardware

“. Third  I/O. http://www.thirdio.com/rowhammer.pdf

[5] Mark Seaborn.” How physical addresses map to rows and banks in DRAM”;

[6] Peter Pessl, Daniel Gruss, Clémentine Maurice, Michael Schwarz, Stefan Mangard: „Reverse Engineering Intel DRAM Addressing and Exploitation“

[7] Zelalem Birhanu Aweke, “Rowhammer without CLFLUSH,” https://groups.google.com/forum/#!topic/rowhammer-discuss/ojgTgLr4q M, May 2015, retrieved on July 16, 2015.

[8] Daniel Gruss, Clémentine Maurice, Stefan Mangard: “Rowhammer.js: A Remote Software-Induced Fault Attack in JavaScript”

[9] Rui Qiao, Mark Seaborn: “A New Approach for Rowhammer Attacks”.http://seclab.cs.sunysb.edu/seclab/pubs/host16.pdf

[10] Anders Fogh: “Row hammer, java script and MESI”http://dreamsofastone.blogspot.de/2016/02/row-hammer-java-script-and-mesi.html

[11] Anders Fogh, Nishat Herath. “These Are Not Your Grand Daddys CPU Performance Counters”. Black Hat 2015. http://dreamsofastone.blogspot.de/2015/08/speaking-at-black-hat.html

[12] Daniel Gruss, Clémentine Maurice, Klaus Wagner, Stefan Mangard: “Flush+Flush: A Fast and Stealthy Cache Attack”

[13] Mathias Payer: “HexPADS: a platform to detect “stealth” attacks“. https://nebelwelt.net/publications/files/16ESSoS.pdf

[14] Zelalem Birhanu Aweke, Salessawi Ferede Yitbarek, Rui Qiao, Reetuparna Das, Matthew Hicks, Yossi Oren, Todd Austin:”ANVIL: Software-Based Protection Against Next-Generation Rowhammer Attacks”

[15] Anders Fogh: “Anvil & Next generation row hammer attacks”. http://dreamsofastone.blogspot.de/2016/03/anvil-next-generation-row-hammer-attacks.html

[16] http://apt.cs.manchester.ac.uk/projects/ARMOR/RowHammer/armor.html

[17]  http://www.jedec.org/standards-documents/results/jesd209-4

[18] Sarani Bhattacharya, Debdeep Mukhopadhyay: “Curious case of Rowhammer: Flipping Secret Exponent Bits using Timing Analysis”. http://eprint.iacr.org/2016/618.pdf

[19] Erik Bosman, Kaveh Razavi, Herbert Bos, Cristiano Giuffrida “Dedup Est Machina: Memory Deduplication as an Advanced Exploitation Vector”

[20] Yuan Xiao, Xiaokuan Zhang, Yinqian Zhang, and Mircea-Radu Teodorescu:”One Bit Flips, One Cloud Flops: Cross-VM Row Hammer Attacks and Privilege Escalation”.  To be published

[21] Mark Lateigne: “A Tale of Two Hammers. A Brief Rowhammer Analysis of AMD vs. Intel”. http://www.thirdio.com/rowhammera1.pdf

Cache side channel attacks: CPU Design as a security problem

 

End of May I had the opportunity to present my research on cache side channel attacks at the “Hack In The Box” conference. After my presentation with Nishat Herath last year at black hat I published my private comments to that slide deck and that was well received. I had decided to do that again for “Hack In The Box”, unfortunately it took me a little longer to translate my comments into something human readable. But here they are. Since the comments relate directly to a specific slide in the slide deck you’ll probably want to have the slide deck open when reading this blog post. You can find them here: https://conference.hitb.org/hitbsecconf2016ams/materials/D2T1%20-%20Anders%20Fogh%20-%20Cache%20Side%20Channel%20Attacks.pdf

Cache side channel attacks: CPU Design as a security problem

The title of the presentation took quite a while to figure out because I wanted a title that fairly and accurately described the content of the presentation, but one that was also slightly catchy. I like what I came up with.

Here I told how I got into cache side channel attacks as an introduction. I was working on a talk about row hammer detection, when Daniel Gruss tweeted he’d been able to row hammer in Java script. Consequently, I had to figure out how he did it, so that I could speak with confidence. Everything pointed towards the cache side channel attack literature, so I dug in and 4 hours later produced my first cache side channel attack. This side channel attack triggered my row hammer detection and that let me into a year worth of playing with CPU caches.

 

Agenda

When I start doing slides, I always think about what I want to come across to the listener well knowing that most people will have forgotten most two hours later.

1)      I wanted to point out that safe software running on unsafe hardware is unsafe – something I think is too often forgotten. Further I wanted to make the point that we are not talking about a “bug” in the sense that an intel engineer made a mistake. Rather the cache system generally works as designed – though this design has unintended consequences: cache side channel attacks. Specifically:

  1. Attacker can manipulate the cache: Shared L3 (between cores & trust zones) + Inclusive cache hierarchy
  2. The cache state can be queried and give back a lot of valuable information: Timing attacks (on access & clflush), Shared L3, N-Way-Set associative cache, complex addressing function for L3 that application access to sets.

2)      Here I just wanted to point out that this isn’t just an academic issue. In many ways cache side channel attacks, can be a real world alternative to VM breakout and privilege escalation attacks – or even a method to support either of those.

3)      I wanted to make the point that we can do something about these critters – even when the attacker adopts.

Introduction

  • The most important things I wanted to get across was CSCs are an attack methodology and not an attack. Further that the cache design is not a requirement to be an x86, that is the cache could be designed differently and be less useful for cache side channel attacks without causing software incompatibility.
  • The bandwidth of the side channel is essentially the bandwidth of the cache and thus a lot of valuable data can be leaked. The size of the caches makes the retention time of the information high and thus allows us to actually read out the data in most cases (say compare to D’Atoine’s attack on instruction reordering where the information is lost immediately if not queried). Finally, software cannot really avoid using the cache and thus leak information.
  • Here I wanted to point out that intel is unlikely to provide a fix. They won’t replace old processors, micro code update is unlikely and Intel have brought out CPU’s for a while without fixing. Leading us towards software fixes for a hardware problem. This is a situation which is always suboptimal, but short to medium term the only option. It was my intention to show that while suboptimal entirely within the realm of the possible.

Scope

Litterature: ARM: [11], TLB[9], RowBuffer[10]

L3 is not small

Point here was to give a sense of scale on the importance of the L3 cache in the CPU. Also it’s kind of neat you can actually see slices of the L3 cache in a die shot.

Why is this interesting

The point of this slide was to illustrate the flexibility and power of CSC’s. Important to notice here: CSC’s are always attack on software implementation – not algorithms. An RSA implementation might leak it’s private key, another implementation may not.

Literature: Covert channels [13], RSA[4], AES[6], ECDSA[5],Keyboard[7],Mouse[8], KASRL[9]

How the data cache works on Intel + Important cache features

Here I wanted to introduce the design features that allows the attacker to manipulate the cache hierarchy. The second slide is just a summary slide.

 

How memory is stored in L3 + N-Way Set Associative L3

Here I wanted to introduce how the cache stores data. This is further back ground for manipulation, but also serves as a back ground for getting information from the cache. The “intel’s problem” is a so called fully associative cache on bytes. Essentially cache line concept limits accuracy of the attacks and the N-Way set associative cache allows an attacker to work on much smaller units making attacks practical. Further it allows an attacker to deduct information from cache-set congruence, something that is supported by the complex addressing function for the cache. I avoided the complex addressing function deliberately to avoid complexity and just mentioned that “the colors are not clumped together, but spread out”. Cache side channel attacks has historically only approximated this complex addressing function by doing a simple timing attack – loading N+1 addresses from the same set will cause one not to be cached and thus accessing it will be slow. The complex addressing function for cache sets has been reverse engineered resonantly for most intel CPU’s.

 

Example code to attack

The example code I use to explain the attacks is inspired by a function in the GDK library which is significantly more complex than this. I wanted to underline that code as well as data can be attacked and that branches matter. I wanted to speak about the big 3 attacks in the following slides by highlight some differences in a concrete example. The GDK library was first attacked by [7].

Common for all CSC

Here the goal was to introduce timing attack and all the cache manipulation primitives of the big 3 caches. Evict is just accessing enough cache lines congruent to a cache set to cause all previously stored information to be evict. Prime is a variation on Evict – where you carefully access cache lines congruent to a cache set in a way that the attacker knows what is in the cache set. The flush primitive is simply the clflush instruction, that removes a specific cache line from the cache hierarchy.

 

Big 3 Cache side channel Attacks

I wanted to comment that the 3 attacks and the variants of them make up most CSC’s and that they are all fairly similar in structure. Also I wanted to point back to the primitives described in the previous slide.

Literature: Evict+Time[1], Prime+Probe[2], Flush+Reload[3]

“The 4 slides on the attacks”

In these slides I wanted to highlight some advantages and disadvantages for each attack. The Evict+Time does not give any temporal resolution – which I call “post mortem analysis”, but you don’t need to worry about synchronizing with victim. Synchronizing can be done by any means including a separated CSC, a function call or any other information leak. Or even spying constantly. Though the accuracy is cache congruence it should probably be noted that prime and probe is disturbed by any cache set congruent memory operation whereas Evict+Time is only disturbed by those who evict exactly those cache lines used by the function. However, calling a function tends to bring more non-relevant code into play and thus noise. Noise is a more complex topic that the impression the overview slide gives.

Shared memory

I heard the „nobody shares memory“comment one too many times and wanted to raise a flag that it isn’t that uncommon. Finally, I wanted to warn against shared memory – particularly as it’s introduced through deduplication as that’s the most common vector in VM environments. Flush+Reload is just an example of the problems with dedublication, but one can can pile on with D’Antoine’s instruction reordering attack, dedublication attacks, more speculatively row hammer can become a practical cross VM vector with deduplication, row buffer side channel attacks etc. etc.

 

Noise

I wanted to point out that CSC’s are noisy. Usually the noise is due to contention with irrelevant code running on multicore CPU’s or contention in other subsystem than the cache. Also the hardware prefetcher can destroy things for an attacker. Of these effects only the effect of the hw prefetcher cannot be eliminated by repeating the attack – though obviously not all attacks lend themselves to be repeated (you cannot ask a user for his password 10k times). Sometimes you get good results in the first attempt. I had an Evict+Time attack that required more than 10k attempts to give satisfying results.

 

Performance counters

My „agenda“on my blackhat talk last year was to communicate that performance Counters can be an important tool for security related weirdness in the CPU. Part of this slide is an attempt to repeat that message. The important part was to introduce performance counters, performance counter interrupt and setting up the counter as they form an important part of my work on detecting cache side channel attacks.

Flush + Reload code

Here the Clflush is used as a “manipulate cache to a known state primitive”.

The Mov instruction (which could be replaced by most memory access instructions) is the reload phase. The RDTSC(P) does the actual timing of the reload phase and the mfence instructions prevents that the CPU reorder the instructions to do the reload phase outside the rdtsc(p) bracket.

My comments read that I should explain that cache side channel attacks like flush+reload is in a race condition with the process being spied upon.–Say if we’re attacking keyboard input we’ll be less visible if we wait a few milliseconds per iteration because nobody types that fast whereas for crypto we almost always need much higher temporal resolution and usually wouldn’t wait at all.

Detecting flush+reload

My original suggestion was to see how long a given count of cache misses would take. If too fast we had a cache side channel attack. [12] and [13] improved on that. All works fairly well.

Row hammer was flush + reload

Just noting here that if we remove the information acquisition and do it for two addresses at once we have the original row hammer code. It’s then easy to see that row hammer was a flush+reload attack. The “was” word was carefully chosen. Others has shown that the movnti instruction is a vector for row hammer too, and that vector is not a row hammer related attack. To round off my introduction I hope I mentioned that rowhammer.js was a flush+reload variation that I (and others) usually call Evict+Reload using the eviction primitive I discussed in a previous slide.

 

Flush + Flush

The back story here is I’d figured that clflush would leak information about the cache state and when approached by Daniel Gruss and Clementiné Maurice about detecting a cache attack that causes less cache misses I immediately knew, what they were talking about. Instead of competing to publish I did not do more work in this direction. I did get to review their wonderful paper though.

Flush+flush works with replacing the mov instruction in flush+reload with a clflush but is otherwise identical. The clflush instruction is probably faster when the cache line parameter isn’t in the cache because it’s able to shortcut actually flushing the cache.

Flush+flush has an advantage beyond stealthiness: clflush is faster than mov on uncached memory. Also it leaves the cache in a known state which means the first line of code can be skipped when iterating the attack. This attack is probably the fastest CSC. Also the clflush instruction introduces less problems with the hardware prefetchers. Literature: [13]

Why is flush+flush Stealth

Clflush does not cause cache misses! However, the victim still incurs cache misses due to flush and reload. This usually isn’t sufficient for the flush+reload detection I outline in previous slides to get decent detection rates without incurring significant amount of false positives.

 

Detecting flush+flush

The first point is an assumption that we’re not talking about a cross VM attack. My opinion is that cross VM flush+flush attacks should be foiled by not using dedublication. It’s also an assumption that I need. In a cross VM attack the attacker could use ring 0 to attack and thus bypass the CR4.TSD access violation. However, it is worth noting that even in this scenario it would make flush+flush more difficult.

The other 3 points are just the technology I use to catch flush+flush with.

Stage 1

This is actually a revamped version of my “Detecting flush+flush” blog post. After posting that I had some private conversations where my method got criticized for being too costly performance wise. So I first tried to find a better performance counter. There was an event that actually mentioned CLFLUSH in the offcore counters, unfortunately it was only available on a small subset of microarchitectures that I deemed too small to be worthwhile. I actually tried to see if intel just changed documentation and it appears they really revamped that event for other purposes. Then I played around with the CBO events in the cache system, though I could get better than cache-ref it was at the cost of being gameable from an attacker view point. So instead I decided to make a two stage approach – first detect the bracket and then detect clflush instruction. I had two reasons for this approach. The first is to deal with the performance impact of the second stage which can be quite severe. The second is I have a feeling that we’ll see other instruction-timing attacks in the months to come and this two stage model maps well to defending this kind of problem. The reason why this two stage works better is that the RDTSC instruction itself is rare, but in pairs spaced close enough for an attacker to not drown in noise is so rare that I’ve not seen a single benign application causing this in my testing.

Problems?

Using CR4.TSD to make rdtsc privileged affects performance two fold. First it causes a very significant slowdown on the RDTSC instruction with emulating it in an exception handler. However, the RDTSC is rarely used – in particularly rarely used in a fashion where it makes up a significant part of the total execution time and thus I think the performance penalty is acceptable. Much worse is that the RDTSC instruction becomes less accurate which might cause issues. Short of profiling code, I’ve yet to come up with a benign case where such accuracy would be required. I may be wrong though.

The detection obviously hinges strongly around the RDTSC being used as a timer. I’ve yet to come up with a good alternative to rdtsc, but that doesn’t mean it doesn’t exists.

 

Omissions

The Gory details left out is mostly eviction policy and complex addressing function related stuff. Such as finding eviction sets and priming the cache. There are many suggestions for mitigation – not sharing memory being one that’s very good but incomplete as it doesn’t work for evict+time and prime+probe. There are different ways of adding noise or making the timer less accurate – all in my opinion fundamentally flawed as noise or less timer accuracy can be defeated by law of large numbers. The Cache Allocation Technology can be used to protect against CSC’s – here I’m not a fan because it requires software to be reauthored to be protected and that it has limited coverage. Finally, it’s worth mentioning writing branch-free-cache-set-aware code which actually works, but is difficult and we’re unlikely to be able to demand that all affected software does so.