Negative Result: Reading Kernel Memory From User Mode

I were going to write an introduction about how important negative results can be. I didn’t. I assume you can figure out for yourself why that is and if not you got all the more reason to read this blog post. If you think it’s trivial why my result is negative, you definitely need to read the blog post.

The memory subsystem

I think most researchers would immediately think that reading kernel memory from an unprivileged process cannot work because “page tables”. That is what the Instruction Set Architecture or ISA says. But in order to understand why that was unsatisfactory to me we need to dig a bit deeper and I’ll start at the beginning.

 
When a software running on a Core requires memory it starts a so called “load” command. The load command is then processed in multiple stages until the data is found and returned or an error occurred. The Figure below shows a simplified version of this sub system.

Memory hierachy.png

Software including operating system uses virtual addressing to start a load (which is what a memory read is called inside the CPU). The first stage of processing is the L1 cache. The L1 cache is split between a data and an instruction cache. The L1 Cache is a so called VIPT or Virtually Indexed, Physically Tagged cache. This means the data can be looked up by directly using the virtual address of the load request.  This along with central position in the core makes the L1 incredibly fast. If the requested data was not found in the L1 cache the load must be passed down the cache hierarchy. This is the point where the page tables come into play. The page tables are used to translate the virtual address into a physical address. This is essentially how paging is enabled on x64. It is during this translation that privileges are checked. Once we have a physical address, the CPU can query the L2 cache and L3 cache in turn. Both are PITP caches (Physically Indexed, Physically Tagged) thus requiring the translation in the page tables before the lookup can be done. If no data was in the caches then the CPU will ask the memory controller to fetch the data in main memory. The latency of a data load in the L1 cache is around 5 clock cycles whereas a load from main memory is typically around 200 clock cycles. With the security check in the page tables and the L1 located before those we see already at this point that the because “page table” argument is too simple. That said – the intel manuals software developer’s manuals [2] states that the security settings is copied along the data into the L1 cache.

 

Speculative execution

In this and the next section I’ll outline the argumentation behind my theory, but only outline. For those a bit more interested it might be worth looking at my talk at HackPra from January [3] where I describe the pipeline in slightly more detail, though in a different context. Intel CPU’s are super scalar, pipelined CPU’s and they use speculative execution and that plays a big role in why I thought it may be possible read kernel memory from an unprivileged user mode process.   A very simplified overview of the pipeline can be seen in figure below.

 

pipeline.png

The pipeline starts with the instruction decoder. The instruction decoder reads bytes from memory, parse the buffer and output instructions.  Then it decodes instructions into micro ops. Micro ops are the building blocks of instructions. It is useful to think of x86 CPU’s as a Complex Instruction Set Computer (CISC) with a Reduced Instruction Set Computer (RISC) backend. It’s not entirely true, but it’s a useful way to think about it. The micro ops (the reduced instructions)  are queued into the reordering buffer. Here micro ops is kept it until all their dependencies have been resolved. Each cycle any micro ops with all dependencies resolved are scheduled on available execution units in first in ,first out order. With multiple specialized execution units many micro ops are executing at once and importantly due to dependencies and bottlenecks in execution units micro ops need not execute in the same order as the entered the reorder buffer. When a micro op is finished executing the result and exception status are added to the entry in the reorder buffer thus resolving dependencies of other micro ops. When all Micro ops belonging to a given instruction has been executed and have made it to the head of the reorder buffer queue they enter the retirement processing. Because the micro ops are at the head of the reorder buffer we can be sure that retirement is in the same order in which the micro ops were added to the queue. If an exception was flagged in the reorder buffer for a micro op being retired an interrupt is raised on the instruction to which the micro op belonged. Thus the interrupt is always raised on the instruction that caused even if the micro op that caused the interrupt was executed long before the entire instruction was done. A raised interrupt causes a flush of the pipeline, so that any micro op still in the reorder buffer is discarded and the instruction decoder reset. If no exception was thrown, the result is committed to the registers. This is essentially an implementation of Tomasulo’s algorithm [1] and allows multiple instructions to execute at the same time while maximizing resource use.

 

Abusing speculative execution

Imagine the following instruction executed in usermode

mov rax,[somekernelmodeaddress]

It will cause an interrupt when retired, but it’s not clear what happens between when the instruction is finished executing and the actual retirement. We know that once retired any information it may or may not have read is gone as it’ll never be committed to the architectural registers.. However, maybe we have a chance of seeing what was read if Intel relies perfectly on Tomasulo’s algorithm. Imagine the mov instruction sets the results in the reorder buffer as well as the flag that retirement should cause an interrupt and discard any data fetched. If this is the case we can execute additional code speculatively. Imagine the following code:

mov rax, [Somekerneladdress]

mov rbx, [someusermodeaddress]

If there are no dependencies both will execute simultaneous (there are two execution units for loads) and while the second will never get it’s result committed to the registers because it’ll be discarded when the firsts mov instruction causes an interrupt to be thrown. However, the second instruction will also execute speculatively and it may change the microarchitectural state of the CPU in a way that we can detect it. In this particular case the second mov instruction will load the

someusermodeaddress

into the cache hierarchy and we will be able to observe faster access time after structured exception handling took care of the exception. To make sure that the someusermodeaddress is not already in the cache hierarchy I can use the clflush instruction before starting the execution. Now only a single step is required for us to leak information about the kernel memory:

Mov rax, [somekerneladdress]

And rax, 1

Mov rbx,[rax+Someusermodeaddress]

 

If this is executed last two instructions are executed speculatively the address loaded differs depending on the value loaded from somekerneladdress and thus the address loaded into the cache may cause different cache lines to be loaded. This cache activity we can we can observe through a flush+reload cache attack.

The first problem we’re facing here is we must make sure the second and third instruction runs before the first one retires. We have a race condition we must win. How do we engineer that? We fill the reorder buffer with dependent instructions which use a different execution unit than the ones we need. Then we add the code listed above. The dependency forces the CPU to execute one micro op at a time of the filling instructions. Using an execution unit different than the one we used in the leaking code make sure that the CPU can speculatively execute these while working on the fill pattern.

The straight forward fill pattern I use is 300 add reg64, imm. I choose add rax, 0x141. I chose this because we have two execution units that are able to execute this instruction (Integer alus) and since they must execute sequentially one of these units will always be available to my leakage code (provided that another hardware thread in the same core isn’t mixing things up).

Since my kernel read mov instruction and my leaking mov instruction must run sequentially and that the data fetched by the leaking instruction cannot be in the cache the total execution time would be around 400 CLK’s if not the kernel address isn’t cached. This is a pretty steep cost given that an add rax, 0x141 cost around 3 CLK’s. For this reason, I see to it that the kernel address I’m accessing is loaded into the cache hierarchy. I use two different methods to ensure that. First, I call a syscall that touches this memory. Second, I use the prefetcht0 instruction to improve my odds of having the address loaded in L1. Gruss et al [4] concluded that prefetch instructions may load the cache despite of not having access rights. Further they showed that it’s possible for the page table traverse to abort and that would surely mean that I’m not getting a result. But having the data already in L1 will avoid this traverse.
 

All said and done there are a few assumptions I made about Intels implementation of Tomasulo’s algorithm:

1)    Speculative execution continues despite interrupt flag

2)    I can win the race condition between speculative execution and retirement

3)    I can load caches during speculative execution

4)    Data is provided despite interrupt flag

 

Evaluation

I ran the following tests on my i3-5005u Broadwell CPU.

I found no way I could separately test assumption 1,2 and 3. So I wrote up the code I outlined above and instead of code above I used the following code:

Mov rdi, [rdi];  where rdi is somekernelmodeaddress

mov rdi, [rdi + rcx+0]; where rcx is the Someusermodeaddress

Then I timed accessing the Someusermodeaddress after an exception handler dealt with the resulting exception. I did 1000 runs and sorted out  the 10 slowest out. First I did a run with the 2nd line above present with the value at the kernel mode address 0. I then did a second run with the second line commented out. This allows me test if I have a side channel inside the speculative execution. The results is summarized in the below histogram (mean and variance respectively: mean= 71.22 std = 3.31; mean= 286.17 std =  53.22). So obviously I have a statistic significant  covert channel inside the speculative execution.

 
In a second step I made sure that the kernel address I was trying to read had the value 4096.  Now if I’m actually reading kernelmode data the second line will fetch a cache line in the next page. Thus I would expect a slow access to Someusermodeaddress. As the speculative fetch should access a cache line exactly one page further. I selected the offset 4096 to avoid effects due to hardware prefetchers I could’ve added the size of a cache line. Fortunately I did not get a slow read suggesting that Intel null’s the result when the access is not allowed. I double checked by accessing the cache line I wanted to access and indeed that address was not loaded into the cache either. Consequently it seems likely that intel do process the illegal reading of kernel mode memory, but do not copy the result into the reorder buffer. So at this point my experiment is failed and thus the negative result. Being really curious I added an add rdi,4096 instruction after the first line in the test code and could verify that this code was indeed executed speculatively and the result was side channeled out.

 

Pandoras box

While I  did set out to read kernel mode without privileges and that produced a negative result, I do feel like I opened a Pandora’s box. The thing is there was two positive results in my tests. The first is that Intel’s implementation of Tomasulo’s algorithm is not side channel safe. Consequently we have access to results of speculative execution despite the results never being committed. Secondly, my results demonstrate that speculative execution does indeed continue despite violations of the isolation between kernel mode and user mode.

This is truly bad news for the security. First it gives microarchitecture side channel attacks additional leverage – we can deduct not only information from is actually executed but also from what is speculatively executed. It also seems likely that we can influence what is speculative executed and what is not through influencing caches like the BTB see Dmitry Evtyushkin and Dmitry Ponomarev [5] for instance.It thus add another possibility to increase the expressiveness of microarchitecture side channel attacks and thus potentially allow an attacker even more leverage through the CPU. This of cause makes writing constant time code even more complex and thus it is definitely bad news.

Also it draws into doubt mitigations that rely on retirement of instructions. I cannot say I know how far that stretches, but my immediate guess would be that vmexit’s is handled on instruction retirement. Further we see that speculative execution does not consistently abide by isolation mechanism, thus it’s a haunting question what we can actually do with speculative execution.

 

Literature

[1] Intel® 64 and IA-32 Architectures Software Developer Manuals. Intel. https://software.intel.com/en-us/articles/intel-sdm

 

[2] Tomasulo, Robert M. (Jan 1967). “An Efficient Algorithm for Exploiting Multiple Arithmetic Units”. IBM Journal of Research and Development. IBM. 11 (1): 25–33. ISSN 0018-8646. doi:10.1147/rd.111.0025.

 

[3] Fogh, Anders. “Covert shotgun: Automatically finding covert channels in SMT”

https://www.youtube.com/watch?v=oVmPQCT5VkY.

 

[4] Gruss, Daniel, et al. “Prefetch side-channel attacks: Bypassing SMAP and kernel ASLR.”

Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM, 2016.

 

[5] Evtyushkin D, Ponomarev D, Abu-Ghazaleh N. Jump Over ASLR: Attacking Branch Predictors to Bypass ASLR. InProceedings of 49th International Symposium on Microarchitecture (MICRO) 2016.

Statistics and Infosec

 

Introduction

 

The work by John Ioannidis [1] among others started the “reproduction crisis”. The “reproduction crisis” is the phrase used to describe that in many branches of science researchers have been unable to reproduce classic results which form the basis for our understanding of the world. Since the basic principle of modern science is about standing on the shoulders of giants, it is incredibly important that sound methodology is used to insure that our foundation is solid. For now the reproduction crisis has hardly reached information security and I hope it will stay that way. But unfortunately the information security scene is full of issues that make it vulnerable to reproduction problems. Top among them is methodology and the lack of reproduction studies. A huge amount of papers is accepted as true, but never reproduced and many papers suffer from methodologic problems.

 

In this blog post I will argue that among information security papers, very basic statistics could improve the process and reproducibility. Statistics is important as soon as observed data are noisy. Noisy data are everywhere even in computer science. Users are non-deterministic, unobserved variables, manufacturing variance, temperature and other environmental influences all effect our measurements. This of course does not mean that we cannot say anything meaningful about computers – it just means we need to embrace statistics and so far the information security science has not been very successfull at that. This purpose of this blog post is to suggest a baby step in the right direction – using sample size, mean and standard deviation as a mean of improving the expose on noisy data. I intentionally keep the theory short – I encourage everybody to read a real statistics book. Also, with the intention to provoke, I picked two examples of papers where the statistics could have been done better.

 

Statistics, Sample Size, Mean, and Standard Deviation 101

During a lecture my first statistics professor said that he would automatically flunk any of us in the future if we reported on noisy data without mentioning 3 data points on them. These data points are the sample size (N), the mean, and the standard deviation. The first reason for bringing these three data points is that it is as close as one gets to a standard in science . It is accepted in other sciences ranging from physics over medicine to economics. Having a standard on reporting noisy data allows us to compare studies easily and gives us an easy way to reproduce the findings of other people. Unfortunately, this standard appears to not have arrived in full force in information security yet.

But it is not only about a standard. The thing is: using these three values is not just a historical accident. There is sound theory why we want these three values. Imagine a process generating our data – usually called the data generating process or d.g.p. If the d.g.p. produces observations where the noise is independent of other observations we say the d.g.p. produces independent, identically distributed observations or i.i.d. If observations are not i.i.d. we cannot treat any variation as noise and consequently we need further analysis before we can draw any conclusion based on the data. Hence, i.i.d. is often assumed or considered a good approximation. And very often there are good reasons to do so.

With i.i.d. data the arithmetic mean of our observations will almost surely converge towards the mean of the d.g.p as the number of observations increases. This is called the law of large numbers. If one rolls a dice a thousand times and gets an arithmetic mean of 3.0, it is an indication that the dice is not a fair dice. This explains why we wish to report the arithmetic mean. If the number of samples is sufficiently high, then the arithmetic mean is “close” to the real mean of the data generating process.

Also with i.i.d. data the central limit theorem almost always applies. The central limit theorem says that we can expect the mean to tend towards normally distributed as the sample size increase. The normal distribution is characterized by only two parameters, its mean and its standard deviation. It allows us to check hypothesis about the data. The most basic tests can even be made on the back of an envelope. This allows to put a probability on the dice used in the previous example not being a fair dice – that we are not just observing noise.

Why we should include the sample size should be pretty obvious by now. Both the law of large numbers and the central limit theorem hinge on convergence over the sample size. Thus if we have a small sample size, we are unable to rely on any conclusions. That should not surprise you. Imagine throwing a dice 3 times getting 1,2, and 4 – you would be a fool to insist that your data show that 6 rarely happens. If you have thrown the dice a million times and do not see any 6’s the dice almost surely is not fair and the aforementioned conclusion would be pertinent.

In most first (or second if applied with rigor) semester statistics books, you can find a discussion on estimating how many samples you need based on the estimated standard deviation from a pilot sample or prior knowledge on the d.g.p. That of course can be turned around to evaluate if a study had sufficient sample size – yet another reason why these three numbers are valuable.

In short three small numbers in a 14 page paper can provide deep insight into the soundness of your methodology as well as making your paper easily comparable and reproducible.If  you are in doubt, you need to bring these 3 numbers. One word of caution though: I think these numbers are necessary but they may not be sufficient to describe the data. Also you will have no quarrel with me if I can easily calculate the standard deviation from reported data . For example report variance instead of standard deviation or omitting the standard deviation when reporting on proportions (the Bernoulli distribution is described only by its first moment) is just fine.

 

Case study 1: A Software Approach to Defeating Side Channels in Last-Level Caches

I will start out with a pretty good paper that could have been better if standard deviation and mean had been mentioned: “A Software Approach to Defeating Side Channels in Last-Level Caches “ by Zhou et al. [2]. The authors develop CacheBar – a method to mitigate cache side channel attacks. For Flush+Reload attacks they do a copy-on-access implementation to avoid sharing memory which effectively kills Flush+Reload as an attack vector. I shall concentrate on Zhou et al.s [2] protection against Prime+Probe. For this purpose, Zhou et al [2] use an overcommitted page coloring scheme. Using overcommitted page coloring allows for a significantly more flexible memory allocation and thus less performance penalty than the classic page coloring scheme. However, an attacker will remain able to Prime a cache set to a certain extend. Consequently, the scheme does not defeat Prime+Probe, but it does add noise. The paper analyzes this noise by first grouping observations of defender demand on a cache set into “None, one, few, some, lots and most” categories. They subsequently train a Baysian classifier on 500.000 Prime+Probe trials and present the classification result in a custom figure in matrix form. I have no objections thus far. If the authors think that this method of describing the noise provides insights then that is fine with me. I am however missing the standard deviation and mean – they do offer the sample size. Why do I think the paper would have been better if they had printed these two values?

  • The categories are arbitrary: To compare the results to other papers with noise (say noisy timers) we would have to categorize in exactly the same fashion.
  • Having reported means we would immediately be able to tell if the noise is biased. Bias would show up in the figures as misprediction but an attacker could adjust for bias.
  • With standard deviation we could calculate on a back of an envelope how many observations an attacker would need to actually successfully exfiltrate the information she is after and thus evaluate if the mitigation will make attacks impractical in different scenarios. For example, you can get lots of observations of encryption, but you cannot ask a user for his password 10000 times.
  • Reproducing a study will in the presence of noisy data always have slight differences. When the black and white matrix figures are close enough to consider reproduced? For means and standard deviation that question has been resolved.

That all said, I do not worry about the conclusion of this paper. The sample size is sufficient for the results to hold up and the paper is generally well written. I think this paper is a significant contribution to our knowledge on defending against cache side channel attacks.

 

Case study 2: CAn’t Touch This: Practical and Generic Software-only Defenses Against Rowhammer Attacks

Caveat, Apology and More Information

Before I start with my analysis of the statistics in this paper, I should mention that I had given my off-the-cuff comments on this paper before and I generally stand by what I wrote on that occasion. You will find my comments published in Security Week[6]. I did get one thing wrong in my off-the-cuff remarks: I wrote Brasser et al [3] draws heavily on Pessl et al. [4]. Brasser et al. do not use the Dram-mapping function used by Pessl et al.[4], as I initially thought they did. Instead, they use an ad-hoc specified approximate mapping function. Also it is very important to mention that I am referring to an early version of the paper on archiv substance to my previous off-the-cuff remarks or what I write in this blog post. As my critique of the paper is relatively harsh, I emailed the authors ahead of posting the blog post. As of publishing the blog post, I have not yet received an answer.

Getting dirty with it

“CAn’t Touch This: Practical and Generic Software-only Defenses Against Rowhammer Attacks” by Brasser et al. [3] implements two mitigations for the rowhammer problem: B-Catt and G-Catt. G-Catt uses memory partitioning, so that the kernel is not co-located in the same banks as user mode memory which prevents code hammering the kernel space from user mode. I am not convinced that G-Catt cannot be circumvented but that is not the subject of this blog post. I will instead focus on B-Catt. Kim et al [5] suggested not using vulnerable memory addresses and B-Catt implements this idea in a boot loader, rather elegantly using int 15h, subfunction 0xe820 which provides the operating system a list of memory regions it should avoid using. If the computer is not using any vulnerable addresses, one cannot flip bits with row hammer and the system is safe. Obviously, not using some of the physical memory comes at a cost for the end user. Kim et al. [5]concluded that “However, the first/second approaches are ineffective when every row in the module is a victim row (Section 6.3)”. Contrary, Brasser et al.[3] concludes: “…we demonstrate that it is an efficient and practical solution that effectively prevents rowhammer attacks as a short-term solution”. So, the big question is who it gets right here. The key issue is how many pages contain bit flips. Kim et. Al[5] do not provide numbers on pages, but Brasser et al do. They evaluate 3 test systems to support their conclusion: “However, our evaluation (Section VI-A2) suggests that only a fraction of rows are vulnerable in practice.”

And this is where my gripe with the statistics in this paper is. Sometimes you can get away with a tiny sample size. That is when you have prior evidence that the sample variance is either irrelevant or negligible. Relevance of the variance here is given. The prior evidence of a small variance is not present. In fact, I will argue quite the contrary in later.

But first let me get rid of the terrible loose language of “practical and efficient”. My 15 year experience as a professional software developer tells me that software needs to work as advertised on at least 99% of the systems. That is in my opinion a conservative value. Further assume that 5% memory overhead is acceptable for B-Catt users. Notice here that I defined practical as a fraction of systems that must be running with an acceptable overhead. The alternative would have been defining practical as a maximal acceptable average overhead. I consider the fraction approach the most applicable approach in most real world scenarios – it certainly is easier to do back-of-envelope calculations with, which I will be doing for the remainder of this blog post.

Now we can put the sample size of 3 into perspective. Imagine that in reality 10% of systems have more pages with bit flips than are acceptable. With a sample  size of 3 that gives us a 73% (0,93) chance that our evaluation will be “suggesting” that we are indeed practical. Being wrong 73% of the time is not a suggestion of being right. We need more data than 3 observations to assume practicality. Obviously, you may argue that my 10% is pulled out of a hat – and it is indeed. It is a conservative number, but not unrealistic.

This leads us to the question of how much data do we need? The answer to that question depends on the variance in the data, something that we do not know. So, we can either make an a-priori guess based on domain knowledge or do a pilot sample to get an idea. My first semester statistics book [6] states: “We recommend taking at least 20 observations in a pilot sample”. Pp. 419. So, the author’s samples do not even qualify as a pilot sample.

So we are left with turning to domain knowledge. Kim et. al [5] in my opinion has the best available data. They sample 129 modules but do not mention the number of affected pages. They do however mention that 3 modules have more than 40% affected rows. On a two channel skylake system (to my knowledge the best case) we can have 4 rows per page and assume all bit flips occur so that the smallest number of pages are affected – then we end up with 3 modules with at least 10% overhead out of 129. Thus no lesss 2,3% of modules will have too much overhead, according to my definition, in Kim’s samples. This certainly constitutes a reason for skepticism on Brasser et al’s[3] claims.

Kim et al. [5] unfortunately do not calculate a mean or standard deviation – they do report their complete data, so that I could calculate it, but instead I will use a bit of brute force to get somewhere. We know that rowhammer is fairly bad because 6 of 129 samples have more than 1 million bit flips. Assuming that each cell is equally likely to flip that gives us a 14,8% probability that a given page is vulnerable for any of these 6 modules. This assumption has weak support in Kim et. al. [5] Consequently, at least 4.6% of the modules are not practical under my definition and my fairly conservative assumptions. With a finite base of modules (200) and under the assumption that modules are equally common in the wild and 129 observations, we can calculate a 95% confidence interval. This confidence interval tells us that between 2.44% – 6.76% of the assumed 200 modules in the population are vulnerable. Obviously, it is not realistic to assume only 200 modules in existence, but if there are more the confidence interval would be even larger. Thus, under these assumption we must dismiss that B-Catt is practical with 95% confidence. It looks like B-Catt is in trouble.

The assumptions above are of cause very restrictive and thus more research is required. What would this research look like? A pilot sample with at least 20 samples would be a good start. Alternatively, we could use my above assumptions to calculate a sample size. If we do this we end up with the number 80.

Obviously, everything here rides on my definition of practicality and efficiency. My gut feeling is that B-Catt won’t turn out to be practical if we had a reasonable sample. Nevertheless I would love to see data on the issue. While I think my assumptions are fair for a mainstream product, there might be special cases where B-Catt may shine. For a data center that can search for a ram product with few bit flips B-Catt may very well be a solution.

 

Literature

[1] Ioannidis, John PA. “Why most published research findings are false.” PLos med 2.8

(2005): e124.

[2] Zhou, Ziqiao, Michael K. Reiter, and Yinqian Zhang. “A software approach to

defeating side channels in last-level caches.” Proceedings of the 2016 ACM SIGSAC

Conference on Computer and Communications Security. ACM, 2016.

[3] Brasser, Ferdinand, et al. “CAn’t Touch This: Practical and Generic Software-only

Defenses Against Rowhammer Attacks.” arXiv preprint arXiv:1611.08396 (2016).

[4] Pessl, Peter, et al. “DRAMA: Exploiting DRAM addressing for cross-cpu

attacks.” Proceedings of the 25th USENIX Security Symposium. 2016.

[5] Kim, Yoongu, et al. “Flipping bits in memory without accessing them: An

experimental study of DRAM disturbance errors.” ACM SIGARCH Computer

Architecture News. Vol. 42. No. 3. IEEE Press, 2014.

[6] Berry, Donald A., and Bernard William Lindgren. Statistics: Theory and methods.

Duxbury Resource Center,

[7] Kovacs, Eduard. “Researchers Propose Software Mitigations for Rowhammer

Attacks” http://www.securityweek.com/researchers-propose-software-mitigations-

rowhammer-attacks

New cache architecture on Intel I9 and Skylake server: An initial assessment

 

Intel has introduced the new I9 CPU which is seen as HEDT (High-End-DeskTop) product. The micro architecture is in many respects shared with the new Skylake server micro architecture.I f history is a guide technology introduced in this segment slowly trickles down to more budget friend desktops. From a Micro architecture point of view it seems that several things about these CPU’s will force changes on micro architectural attacks – especially in the memory subsystem. In this blog post I’ll do a short overview on some of relevant changes and the effects they’ll may have on micro architecture attacks. Since I don’t own or have access to an actual I9 processor or Skylake server processor this blog post is just conjecture.

 

Resume of the “old” cache hierachy

The major changes over earlier process from a micro architecture point of view is that the cache system has received a significant overhaul. Current Intel CPUs have a 3 level cache hierarchy – two very small L1 caches, one for data, one for instructions. The second level cache (L2 or Mid Latency Cache) is somewhat larger. L1 data, L1 code and L2  part of each core and private to the core. Finally, Intel CPU’s had a huge 3rd level cache (usually called L3 or  largest latency cache) shared between all cores. The 3rd level cache is subdivided into slices that are logically connected to a core. To effectively share this cache, Intel connected them on a ring bus called the Quick Path Interconnect. Further the 3rd level cache was an inclusive cache, which means that anything that is anything cached in L1 or L2 must also be cached in L3.

 

Changes

Some of the important changes that has been announced in the Intel Software Optimization Manuals [1]  are:

–    A focus on a high number of cores in the CPU (up to 18 in the HEDT models)

–    A reduced over all cache size per core (compared to similar older models)

–    A very significant increase in the size of the L2 (factor of 4)

–    Doubled the bandwidth of L2, while only slightly increasing the latency

–    Slightly more than offset by a reduction of the shared L3.

–    Reorganized  L3 cache to be a non-inclusive cache

–    Replaces the QPI with a mesh-style bus.

Why does these changes make sense?

Increasing the size of L2 on the cost of L3 makes sense as the L2 is much faster than L3 for applications – one can only assume that making the L3 helps reduce die size and cost. Also the increase in size of the L2 caches reduces the marginal utility of the L3. Finally  as the probability of cache set contention rises with the number of cores, it becomes advantageous to make a larger part of the total cache private. Cache contention in L3 is a problem Intel has battled before. With Haswell they introduced Cache Allocation Technology (CAT) to allow software to control cache usage of the L3 to deal with contention.

The number of cores is probably also the reason why Intel dropped the QPI ring buffer design. There is a penalty for having memory served from another core’s slices. On a ring bus this penalty is proportional to how far the cores are a part on the ring. With more cores, this penalty increases. Moving to a more flexible system seems sensible from this vantage point.

Having an inclusive L3 makes cache coherency easier and faster to manage. However, an inclusive cache comes with a cost as the same data are loaded in multiple caches. The relative loss of total cache storage space is exactly the ratio of the (L2 +L1) to L3 sizes. Previous this ratio has been around 1:10 (depending on actual CPU), but multiplying the L2 size by 4 and making the L3 a tiny bit smaller the ratio is now about 1:1.5. Thus the making the L3 cache non-inclusive is very essential to performance. At this point it’s important to notice that Intel uses the wording “non-inclusive”. This term is not well defined. The opposite of inclusive is exclusive meaning the content of L3 cannot be loaded in L1 and L2. I think Intel would have used the defined term exclusive if the cache really where exclusive. Thus,   it is probably fair to assume that non-inclusive means that data may or may not be cached in L1, L2 and L3 at the same time, but exactly how this is governed is important. Unfortunately there is no information available on this.

It’s worth noting that many of these changes has been tested and developed by Intel for the Knights landing micro architecture. Knights landing is a high throughput micro architecture, sold in relative small amounts. Thus it’s likely that many features developed for this CPU will end up being trickled down.  It’ll be interesting to see if Intel plans to trickle it down into laptop/small desktops or use different cache designs for different classes of CPUs.

 

Effects

Cache side channel attacks

This new cache layout is likely to have profound effect on cache side channel attacks. I think Flush+Reload will work even without the non-inclusive cache. The flush primitive is based on the CLFlush instruction which is part of the instruction set architecture (ISA). Intel has been very reluctant in the past to change the ISA and therefore my estimate is that flushing will work as always. I think the reload primitive will remain active – I find it likely that an uncached load will also load stuff into the shared L3. Also it’s likely that the QPI replacement bus can be used to fetch data from private L2 caches similar to AMD’s cross CPU transmission. This will give Flush+reload a nice flush+transfer flavor, but it’ll still work. For more on Invalidate (Flush)+Transfer see [2]. Since the L3 cache must be filled in someway we can fairly certain that at least one of these things are true, if not both. The latter being my guess.

The big change are likely to be related to evict and prime primitives. Assuming that cache contention between cores was a major reason for redesigning the cache, it’s unlikely that one can can load stuff into another core’s private hierarchy. Thus, the prime and evict primitives for  cross core attacks.  However,both are likely to work decently within a core (Hyper threading or scheduling on same core).

While the ISA behavior of CLFLush is almost certain to remain unchanged, the micro architecture below it will see significant changes. With QPI gone using the flush+flush attack by Gruss et al. [3] to find out how many hubs on the ring bus away you are from a particular slice almost certainly won’t work. This does not mean that you won’t find a side channel here in CLFlush, buses are typically bandwidth limited thus being an obvious source of congestion – without an inclusive L3 the bus bandwidth might even be easier to consume. Also the Flush+Flush attack as a Flush+reload replacement, is likely to produce a different timing behavior in the new micro architecture. My guess upfront is that a timing difference and thus a side channel remains.

Also affected by the non-inclusiveness of the L3 is row buffer side channel attacks such as those presented by Pessl. et al.[4]  Without an effective eviction cross core attacks may be severely stifled. The ability to reverse engineer the DRAM complex mapping function is likely to remain unchanged as it hinges not on eviction, but the CLFlush instruction’s ISA behavior.

With the CLFlush instruction and evict likely still working on local cores, row hammer will  remain effective. But in some scenarios, indirect effects of  micro architecture changes may break specific attacks such as that of  Sarani Bhattacharya, Debdeep Mukhopadhyay [5]. The reason is that such attacks rely on information leakage in the caches, that becomes more difficult to leverage as described above.

 

Future

While the changes to the cache makes sense with the significant number of cores in the affected systems, it seems unlikely that the changes will trickle down to notebooks and laptops. With only two cores the existing cache design seems sensible. Thus we are likely to see a two tier cache design moving on.

Conclusion

Having a non-inclusive L3 cache is significantly more secure from a side channel perspective than an inclusive in cross core scenarios. This opens up for defending these attacks, by isolating different security domains on different cores potentially dynamically. While flush+reload is likely to be unaffected, this attack is also the easiest to thwart in real life scenarios as avoiding shared memory cross security domains is an available and effective countermeasure. Lots of new research is required to gauge the security of these micro architecture changes.

 

Literature

[1] Intel. Intel® 64 and IA-32 Architectures Optimization Reference Manual. July 2017. https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf

[2] Irazoqui, Gorka, Thomas Eisenbarth, and Berk Sunar. “Cross processor cache attacks.” Proceedings of the 11th ACM on Asia Conference on Computer and Communications Security. ACM, 2016.a

[3] Gruss, Daniel, et al. “Flush+ Flush: a fast and stealthy cache attack.” Detection of Intrusions and Malware, and Vulnerability Assessment. Springer International Publishing, 2016. 279-299.

[4] Pessl, Peter, et al. “DRAMA: Exploiting DRAM Addressing for Cross-CPU Attacks.” USENIX Security Symposium. 2016.

[5] Sarani Bhattacharya, Debdeep Mukhopadhyay: “Curious case of Rowhammer: Flipping Secret Exponent Bits using Timing Analysis”. http://eprint.iacr.org/2016/618.pdf