The work by John Ioannidis  among others started the “reproduction crisis”. The “reproduction crisis” is the phrase used to describe that in many branches of science researchers have been unable to reproduce classic results which form the basis for our understanding of the world. Since the basic principle of modern science is about standing on the shoulders of giants, it is incredibly important that sound methodology is used to insure that our foundation is solid. For now the reproduction crisis has hardly reached information security and I hope it will stay that way. But unfortunately the information security scene is full of issues that make it vulnerable to reproduction problems. Top among them is methodology and the lack of reproduction studies. A huge amount of papers is accepted as true, but never reproduced and many papers suffer from methodologic problems.
In this blog post I will argue that among information security papers, very basic statistics could improve the process and reproducibility. Statistics is important as soon as observed data are noisy. Noisy data are everywhere even in computer science. Users are non-deterministic, unobserved variables, manufacturing variance, temperature and other environmental influences all effect our measurements. This of course does not mean that we cannot say anything meaningful about computers – it just means we need to embrace statistics and so far the information security science has not been very successfull at that. This purpose of this blog post is to suggest a baby step in the right direction – using sample size, mean and standard deviation as a mean of improving the expose on noisy data. I intentionally keep the theory short – I encourage everybody to read a real statistics book. Also, with the intention to provoke, I picked two examples of papers where the statistics could have been done better.
Statistics, Sample Size, Mean, and Standard Deviation 101
During a lecture my first statistics professor said that he would automatically flunk any of us in the future if we reported on noisy data without mentioning 3 data points on them. These data points are the sample size (N), the mean, and the standard deviation. The first reason for bringing these three data points is that it is as close as one gets to a standard in science . It is accepted in other sciences ranging from physics over medicine to economics. Having a standard on reporting noisy data allows us to compare studies easily and gives us an easy way to reproduce the findings of other people. Unfortunately, this standard appears to not have arrived in full force in information security yet.
But it is not only about a standard. The thing is: using these three values is not just a historical accident. There is sound theory why we want these three values. Imagine a process generating our data – usually called the data generating process or d.g.p. If the d.g.p. produces observations where the noise is independent of other observations we say the d.g.p. produces independent, identically distributed observations or i.i.d. If observations are not i.i.d. we cannot treat any variation as noise and consequently we need further analysis before we can draw any conclusion based on the data. Hence, i.i.d. is often assumed or considered a good approximation. And very often there are good reasons to do so.
With i.i.d. data the arithmetic mean of our observations will almost surely converge towards the mean of the d.g.p as the number of observations increases. This is called the law of large numbers. If one rolls a dice a thousand times and gets an arithmetic mean of 3.0, it is an indication that the dice is not a fair dice. This explains why we wish to report the arithmetic mean. If the number of samples is sufficiently high, then the arithmetic mean is “close” to the real mean of the data generating process.
Also with i.i.d. data the central limit theorem almost always applies. The central limit theorem says that we can expect the mean to tend towards normally distributed as the sample size increase. The normal distribution is characterized by only two parameters, its mean and its standard deviation. It allows us to check hypothesis about the data. The most basic tests can even be made on the back of an envelope. This allows to put a probability on the dice used in the previous example not being a fair dice – that we are not just observing noise.
Why we should include the sample size should be pretty obvious by now. Both the law of large numbers and the central limit theorem hinge on convergence over the sample size. Thus if we have a small sample size, we are unable to rely on any conclusions. That should not surprise you. Imagine throwing a dice 3 times getting 1,2, and 4 – you would be a fool to insist that your data show that 6 rarely happens. If you have thrown the dice a million times and do not see any 6’s the dice almost surely is not fair and the aforementioned conclusion would be pertinent.
In most first (or second if applied with rigor) semester statistics books, you can find a discussion on estimating how many samples you need based on the estimated standard deviation from a pilot sample or prior knowledge on the d.g.p. That of course can be turned around to evaluate if a study had sufficient sample size – yet another reason why these three numbers are valuable.
In short three small numbers in a 14 page paper can provide deep insight into the soundness of your methodology as well as making your paper easily comparable and reproducible.If you are in doubt, you need to bring these 3 numbers. One word of caution though: I think these numbers are necessary but they may not be sufficient to describe the data. Also you will have no quarrel with me if I can easily calculate the standard deviation from reported data . For example report variance instead of standard deviation or omitting the standard deviation when reporting on proportions (the Bernoulli distribution is described only by its first moment) is just fine.
Case study 1: A Software Approach to Defeating Side Channels in Last-Level Caches
I will start out with a pretty good paper that could have been better if standard deviation and mean had been mentioned: “A Software Approach to Defeating Side Channels in Last-Level Caches “ by Zhou et al. . The authors develop CacheBar – a method to mitigate cache side channel attacks. For Flush+Reload attacks they do a copy-on-access implementation to avoid sharing memory which effectively kills Flush+Reload as an attack vector. I shall concentrate on Zhou et al.s  protection against Prime+Probe. For this purpose, Zhou et al  use an overcommitted page coloring scheme. Using overcommitted page coloring allows for a significantly more flexible memory allocation and thus less performance penalty than the classic page coloring scheme. However, an attacker will remain able to Prime a cache set to a certain extend. Consequently, the scheme does not defeat Prime+Probe, but it does add noise. The paper analyzes this noise by first grouping observations of defender demand on a cache set into “None, one, few, some, lots and most” categories. They subsequently train a Baysian classifier on 500.000 Prime+Probe trials and present the classification result in a custom figure in matrix form. I have no objections thus far. If the authors think that this method of describing the noise provides insights then that is fine with me. I am however missing the standard deviation and mean – they do offer the sample size. Why do I think the paper would have been better if they had printed these two values?
- The categories are arbitrary: To compare the results to other papers with noise (say noisy timers) we would have to categorize in exactly the same fashion.
- Having reported means we would immediately be able to tell if the noise is biased. Bias would show up in the figures as misprediction but an attacker could adjust for bias.
- With standard deviation we could calculate on a back of an envelope how many observations an attacker would need to actually successfully exfiltrate the information she is after and thus evaluate if the mitigation will make attacks impractical in different scenarios. For example, you can get lots of observations of encryption, but you cannot ask a user for his password 10000 times.
- Reproducing a study will in the presence of noisy data always have slight differences. When the black and white matrix figures are close enough to consider reproduced? For means and standard deviation that question has been resolved.
That all said, I do not worry about the conclusion of this paper. The sample size is sufficient for the results to hold up and the paper is generally well written. I think this paper is a significant contribution to our knowledge on defending against cache side channel attacks.
Case study 2: CAn’t Touch This: Practical and Generic Software-only Defenses Against Rowhammer Attacks
Caveat, Apology and More Information
Before I start with my analysis of the statistics in this paper, I should mention that I had given my off-the-cuff comments on this paper before and I generally stand by what I wrote on that occasion. You will find my comments published in Security Week. I did get one thing wrong in my off-the-cuff remarks: I wrote Brasser et al  draws heavily on Pessl et al. . Brasser et al. do not use the Dram-mapping function used by Pessl et al., as I initially thought they did. Instead, they use an ad-hoc specified approximate mapping function. Also it is very important to mention that I am referring to an early version of the paper on archiv substance to my previous off-the-cuff remarks or what I write in this blog post. As my critique of the paper is relatively harsh, I emailed the authors ahead of posting the blog post. As of publishing the blog post, I have not yet received an answer.
Getting dirty with it
“CAn’t Touch This: Practical and Generic Software-only Defenses Against Rowhammer Attacks” by Brasser et al.  implements two mitigations for the rowhammer problem: B-Catt and G-Catt. G-Catt uses memory partitioning, so that the kernel is not co-located in the same banks as user mode memory which prevents code hammering the kernel space from user mode. I am not convinced that G-Catt cannot be circumvented but that is not the subject of this blog post. I will instead focus on B-Catt. Kim et al  suggested not using vulnerable memory addresses and B-Catt implements this idea in a boot loader, rather elegantly using int 15h, subfunction 0xe820 which provides the operating system a list of memory regions it should avoid using. If the computer is not using any vulnerable addresses, one cannot flip bits with row hammer and the system is safe. Obviously, not using some of the physical memory comes at a cost for the end user. Kim et al. concluded that “However, the first/second approaches are ineffective when every row in the module is a victim row (Section 6.3)”. Contrary, Brasser et al. concludes: “…we demonstrate that it is an efficient and practical solution that effectively prevents rowhammer attacks as a short-term solution”. So, the big question is who it gets right here. The key issue is how many pages contain bit flips. Kim et. Al do not provide numbers on pages, but Brasser et al do. They evaluate 3 test systems to support their conclusion: “However, our evaluation (Section VI-A2) suggests that only a fraction of rows are vulnerable in practice.”
And this is where my gripe with the statistics in this paper is. Sometimes you can get away with a tiny sample size. That is when you have prior evidence that the sample variance is either irrelevant or negligible. Relevance of the variance here is given. The prior evidence of a small variance is not present. In fact, I will argue quite the contrary in later.
But first let me get rid of the terrible loose language of “practical and efficient”. My 15 year experience as a professional software developer tells me that software needs to work as advertised on at least 99% of the systems. That is in my opinion a conservative value. Further assume that 5% memory overhead is acceptable for B-Catt users. Notice here that I defined practical as a fraction of systems that must be running with an acceptable overhead. The alternative would have been defining practical as a maximal acceptable average overhead. I consider the fraction approach the most applicable approach in most real world scenarios – it certainly is easier to do back-of-envelope calculations with, which I will be doing for the remainder of this blog post.
Now we can put the sample size of 3 into perspective. Imagine that in reality 10% of systems have more pages with bit flips than are acceptable. With a sample size of 3 that gives us a 73% (0,93) chance that our evaluation will be “suggesting” that we are indeed practical. Being wrong 73% of the time is not a suggestion of being right. We need more data than 3 observations to assume practicality. Obviously, you may argue that my 10% is pulled out of a hat – and it is indeed. It is a conservative number, but not unrealistic.
This leads us to the question of how much data do we need? The answer to that question depends on the variance in the data, something that we do not know. So, we can either make an a-priori guess based on domain knowledge or do a pilot sample to get an idea. My first semester statistics book  states: “We recommend taking at least 20 observations in a pilot sample”. Pp. 419. So, the author’s samples do not even qualify as a pilot sample.
So we are left with turning to domain knowledge. Kim et. al  in my opinion has the best available data. They sample 129 modules but do not mention the number of affected pages. They do however mention that 3 modules have more than 40% affected rows. On a two channel skylake system (to my knowledge the best case) we can have 4 rows per page and assume all bit flips occur so that the smallest number of pages are affected – then we end up with 3 modules with at least 10% overhead out of 129. Thus no lesss 2,3% of modules will have too much overhead, according to my definition, in Kim’s samples. This certainly constitutes a reason for skepticism on Brasser et al’s claims.
Kim et al.  unfortunately do not calculate a mean or standard deviation – they do report their complete data, so that I could calculate it, but instead I will use a bit of brute force to get somewhere. We know that rowhammer is fairly bad because 6 of 129 samples have more than 1 million bit flips. Assuming that each cell is equally likely to flip that gives us a 14,8% probability that a given page is vulnerable for any of these 6 modules. This assumption has weak support in Kim et. al.  Consequently, at least 4.6% of the modules are not practical under my definition and my fairly conservative assumptions. With a finite base of modules (200) and under the assumption that modules are equally common in the wild and 129 observations, we can calculate a 95% confidence interval. This confidence interval tells us that between 2.44% – 6.76% of the assumed 200 modules in the population are vulnerable. Obviously, it is not realistic to assume only 200 modules in existence, but if there are more the confidence interval would be even larger. Thus, under these assumption we must dismiss that B-Catt is practical with 95% confidence. It looks like B-Catt is in trouble.
The assumptions above are of cause very restrictive and thus more research is required. What would this research look like? A pilot sample with at least 20 samples would be a good start. Alternatively, we could use my above assumptions to calculate a sample size. If we do this we end up with the number 80.
Obviously, everything here rides on my definition of practicality and efficiency. My gut feeling is that B-Catt won’t turn out to be practical if we had a reasonable sample. Nevertheless I would love to see data on the issue. While I think my assumptions are fair for a mainstream product, there might be special cases where B-Catt may shine. For a data center that can search for a ram product with few bit flips B-Catt may very well be a solution.
 Ioannidis, John PA. “Why most published research findings are false.” PLos med 2.8
 Zhou, Ziqiao, Michael K. Reiter, and Yinqian Zhang. “A software approach to
defeating side channels in last-level caches.” Proceedings of the 2016 ACM SIGSAC
Conference on Computer and Communications Security. ACM, 2016.
 Brasser, Ferdinand, et al. “CAn’t Touch This: Practical and Generic Software-only
Defenses Against Rowhammer Attacks.” arXiv preprint arXiv:1611.08396 (2016).
 Pessl, Peter, et al. “DRAMA: Exploiting DRAM addressing for cross-cpu
attacks.” Proceedings of the 25th USENIX Security Symposium. 2016.
 Kim, Yoongu, et al. “Flipping bits in memory without accessing them: An
experimental study of DRAM disturbance errors.” ACM SIGARCH Computer
Architecture News. Vol. 42. No. 3. IEEE Press, 2014.
 Berry, Donald A., and Bernard William Lindgren. Statistics: Theory and methods.
Duxbury Resource Center,
 Kovacs, Eduard. “Researchers Propose Software Mitigations for Rowhammer