Our 2015 USENIX Security Symposium paper titled “Raccoon: Closing Digital Side Channels Using Obfuscated Execution” presents a software defense against a class of for side channels on x86 processors. The key contribution of that paper is a technique to close a broad range of side channels (which we call digital side channels) using a single solution. For details about Raccoon’s operation, see here.
I have three reasons to write this post: (a) to present the key (new) insights that guided our design of Raccoon (and our subsequent research), (b) to explain nuances between design choices (that could not be addressed in the published paper), and (c) to clarify a possible point of confusion about the comparison between Raccoon and GhostRider.
The key insights that enables Raccoon to close digital side channels using optimizing compilers and modern processor architectures (like x86) are:
The first insight enables writing secure code, while the second insight enables safe optimization of secure code. Raccoon builds upon these two insights to generate high-performance secure code on x86 processors. Let us look at a couple of examples.
cmov instruction is well documented in Intel’s Software Developer Manuals. This allows us to verify the impact of executing the instruction on the microarchitectural state (flags, registers, exceptions, etc.). Through manual testing, we can verify the impact of this instruction on performance counter results and timing. The
cmov instruction (which has been previously analyzed and used in security-sensitive open source applications) forms the core building block of Raccoon, making it possible to implement Raccoon on x86 processors.
Decades of prior research has produced hardware and software optimizations. However, the traditional approach for executing secure code has been to disable optimizations (in the compiler, runtime system, or the processor), effectively discarding the developments of the past decades. It seems to me that the blame is incorrectly attributed. Instead of separating optimizations from secrets, Raccoon separates optimization policies from secrets.
Let”s consider an example to clarify the difference. Caches use the trace of memory accesses to decide the locations that will reside close to the processor core. One way to prevent the access trace from leaking secret information is to disable the cache. However, the same result is obtained if the memory access trace is made independent of the secrets. The optimization policy (access trace) is now decoupled from the secrets without having to disable the optimization (cache).
If we focus on optimization policies rather than the optimization, we can not only continue to enable hardware and software optimizations (thus continuing to benefit non-secret code), but we can also bring performance benefits to secure code. A secure code that load values from the processor cache will be orders of magnitude faster than the secure code that has to load values from the DRAM.
Similar arguments could be made for other microarchitectural components (e.g. branch predictors). In fact, this can be a guiding policy for converging security and performance — an idea that we’re currently pursuing.
ORAM controllers haven’t made their way into x86 processors, so Raccoon was left to either stream over the entire array or to use a modified Path ORAM (implemented by manually applying Raccoon's principles to a non-secure Path ORAM code for x86 processors). The question that I would like to address here is: can we make some intelligent comparisons between streaming and software Path ORAM?
Certainly, streaming makes use of caches and prefetchers due to the predictable address trace, while ORAM’s random access trace hurts caches and prefetchers. But there is one other subtle difference: array accesses with constant index (e.g.
array). There is no point streaming over the entire array for such accesses — the adversary already knows the index from looking at the code. However, if you were to use ORAM, a direct fetch from the ORAM block containing the 10th integer in
array breaks the invariant — that the adversary never knows the placement of ORAM blocks in the tree. As a result, using ORAM forces the obfuscation code to use expensive techniques to access memory that otherwise used constant indexes.
It is unclear, though, whether ORAM will win over streaming asymptotically. As data sizes become large, the data will grow beyond the size of the cache (which happens for Raccoon’s data sizes as well), but the prefetchers are expected to bring data into the cache before it is referenced. Certainly, the number of memory locations accessed using streaming outweighs those accessed using ORAM, but caches and prefetchers introduce a non-linear effect in execution time.
As we note in our paper, we could not precisely mimic the GhostRider processor using our simulator. Some of the differences between GhostRider and our simulator makes GhostRider look better. For instance, we could not model fixed-latency instructions and AVX vector instructions in our simulator. Similarly, our simulator did not include a scratchpad memory. GhostRider relies on scratchpad memory for improved performance but the lack of scratchpad support in the simulator made it impossible to perform an apples-to-apples comparison. Manual analysis of scratchpad usage for thousands of iterations in half a dozen benchmark programs is a truly onerous task. In the end, we chose to model GhostRider as closely as possible within the given limitations — with variable-latency instructions, using perfect caches for non-ORAM accesses, without using scratchpad memory, and using branch prediction.