Thursday, March 20, 2014

Reversing EMET's EAF (and a couple of curious findings...)

EMET is a very useful tool that allows a user to configure the security protections against some common, well known, attack vectors. In this blog entry I will focus on EAF, pointing out some issues that affect the current implementation. EAF stands for Export Address Filtering and, as the name suggests, this protection controls the access to the Export Table of a couple of major system DLLs, in order to make it more difficult for an attacker to obtain the addresses of the APIs if the request is performed from outside executable modules (e.g. from a shellcode running from the stack, or from the heap).

Here is a snapshot of the EMET configuration interface, where you can see all the available protections (including EAF):




EMET uses the Shims engine to inject its module inside all the protected processes: if you inspect a process (e.g. with Process Explorer) on which EMET is active, you will notice the presence of EMET.dll, which means that at least one protection is active for that process. So, EMET operates from the inside of the process in order to enable its protections, but, despite this "invasive" approach, I haven't noticed problems in performance or functionality. Some compatibility problems do exist (given the tricky nature of some protections), but they are well documented for all the most common software.

Let's start focusing on EAF itself. First, EMET protects EMET.dll by calling the GetModuleHandleEx API: if as its parameters you specify the flags GET_MODULE_HANDLE_EX_FLAG_FROM_ADDRESS and GET_MODULE_HANDLE_EX_FLAG_PIN, and an address inside the EMET.dll itself, as a result, the DLL will stay loaded until the process is terminated (no matter how many times FreeLibrary is called).

Then, EMET reads the Export Table of kernel32.dll and of ntdll.dll (the two DLLs being protected) and, in both cases, saves the AddressOfFunctions field (from the IMAGE_EXPORT_DIRECTORY structure) that contains the address at which all the exported APIs addresses are located. Having done that, EMET installs a global Exception Handler by calling the AddVectoredExceptionHandler API, which will be used to filter all the exceptions that occur when a hardware breakpoint is hit. I will describe this Exception Handler routine later.

Now EMET proceeds in activating the protection by forking the execution into two threads.

The main one uses the CreateToolhelp32Snapshot/Thread32First/Thread32Next APIs to get a list of all the running threads of the current process and saves them in an array:

.text:0005486D                 push    0FFFFFFFFh      ; dwMilliseconds
.text:0005486F                 push    array_mutex     ; hHandle
.text:00054875                 call    ds:WaitForSingleObject
.text:0005487B                 mov     eax, thread_count
.text:00054880                 cmp     eax, 256
.text:00054885                 jnb     short loc_54897
.text:00054887                 mov     ecx, [ebp+thread_id]
.text:0005488A                 mov     tid_array[eax*4], ecx
.text:00054891                 inc     thread_count


The second one retrieves all the threads from the array and activates the hardware breakpoints on them in order to protect the AddressOfFunctions fields (one per DLL) mentioned above.
Such array has a hardcoded size of 256 DWORDS, but don't be disappointed: this is only a temporary buffer where the new threads are added until they are processed, and then removed, by the protector thread. 
Moreover, EMET uses a mutex (actually saved as the first element of the array) to synchronize the access to the thread list, thus ensuring that all the newly added threads are processed before the array fills up with 256 of them:


.text:00054906             Protector_Loop:
.text:00054906                 push    100
.text:00054908                 call    ds:Sleep
.text:0005490E                 push    0FFFFFFFFh
.text:00054910                 push    array_mutex
.text:00054916                 call    ds:WaitForSingleObject
.text:0005491C                 mov     ebx, thread_count
.text:00054922                 test    ebx, ebx
.text:00054924                 jz      short loc_5498B
  ...


Still, there is a curious race condition: there is a certain amount of time that passes between the creation of a thread, its insertion in the array and the activation of EAF in the protector thread. Due to this delay, new created threads (including the main application one) won't be protected by EAF in the initial time of their execution.



The Windows scheduler allows each thread to run only in a limited slot of time, after which the execution will be passed to other threads. In this way, in most scenarios, a new thread (including the main one) will run for some time before the execution will eventually yield to the protector thread, that will, then, activate the EAF protection. But what if this thread runs vulnerable code before the scheduler could allow the execution of the protector one? Is that possible? Well, in theory it is and, actually, this is also how I discovered the race condition in the first place. 

I created a little application that accesses the AddressOfFunctions field of the kernel32.dll Export Table from a shellcode loaded outside executable modules (in the heap), prints it and then quits. I also activated EAF from the EMET tool. My application should have crashed, but instead it worked without any problem and I couldn't understand why. Moreover, I made my application print the hardware debug registers, and I noticed that the hardware breakpoints were never set. Debugging EMET.dll I discovered the race condition: so, I added a Sleep() in the entry point of my test application to give the EAF protector thread the time to run, and lo and behold, my application crashed as expected when the AddressOfFunctions field was read from the malicious shellcode. 
The same holds if I do an analogous test on new created threads, not just the main one: there is a small window of vulnerability during the beginning of every thread, but it's very unlikely that an attacker will ever take advantage of it.

Here is the source code of my test application:

 #include <Windows.h>  
 #include <stdio.h>  
   
 DWORD getApiAddress(void)  
 {  
      DWORD KernelImagebase, *pNames, *pAddresses, pCreateFile;  
      IMAGE_DOS_HEADER *pMZ;  
      IMAGE_NT_HEADERS *pPE;  
      IMAGE_EXPORT_DIRECTORY *pExpDir;  
      CHAR *currentName;  
   
      KernelImagebase = (DWORD)LoadLibrary(L"Kernel32.dll");  
   
      pMZ = (IMAGE_DOS_HEADER*)KernelImagebase;  
      pPE = (IMAGE_NT_HEADERS*)(KernelImagebase + pMZ->e_lfanew);  
      pExpDir = (IMAGE_EXPORT_DIRECTORY*)(KernelImagebase + pPE->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_EXPORT].VirtualAddress);  
   
      pNames = (DWORD*)(KernelImagebase + pExpDir->AddressOfNames);  
      pAddresses = (DWORD*)(KernelImagebase + pExpDir->AddressOfFunctions);  
      for(int i = 0; i < pExpDir->NumberOfNames; i++)  
      {  
           currentName = (CHAR*)(KernelImagebase + pNames[i]);  
           if(lstrcmpA(currentName, "CreateFileA") == 0)  
           {  
                pCreateFile = (DWORD)(KernelImagebase + pAddresses[i]);  
           }  
      }  
   
      return pCreateFile;  
 }  
   
 void main(void)  
 {  
      DWORD apiAddress;  
   
      Sleep(2000);          // this delay will fix the race condition!  
   
      // print the debug registers  
   
      CONTEXT myContext;  
      memset(&myContext, 0, sizeof(myContext));  
      myContext.ContextFlags = CONTEXT_ALL;  
      HANDLE hThread = GetCurrentThread();  
      if(!GetThreadContext(hThread, &myContext)){  
           printf("cannot get thread context \n");  
      }  
      printf("main D0: %08x, D1: %08x, D2: %08x, D3: %08x\n",   
           myContext.Dr0, myContext.Dr1, myContext.Dr2, myContext.Dr3);  
   
      // test1: checking the export table of kernel32.dll from this executable module  
   
      apiAddress = getApiAddress();  
      printf("Test1 CreateFileA function: %08x \n", apiAddress);  
   
      // test2: checking the export table ok kernel32.dll from the heap  
   
      DWORD functionSize, pMain, pgetApiAddress;  
   
      pMain = DWORD(&main);  
      pgetApiAddress = DWORD(&getApiAddress);  
      functionSize = pMain - pgetApiAddress;  
   
      BYTE *shellcode = (BYTE*)malloc(functionSize);  
      memcpy(shellcode, (BYTE*)pgetApiAddress, functionSize);  
   
      __asm  
      {  
           mov  ebx, shellcode  
           call  ebx  
           mov  apiAddress, eax  
      }  
   
      free(shellcode);  
   
      printf("Test2 CreateFileA function: %08x \n", apiAddress);  
   
      getchar();  
 }  


Note: when you compile this code (I used Visual Studio), you must disable all the optimizations to avoid changes to the code layout, and also remove DEP from the linker options.

"test1" retrives the address of the CreateFileA API from inside the executable module; "test2" does the same from the heap.

If you don't add the Sleep(2000) in the main() function, you will get this output:

main D0: 00000000, D1: 00000000, D2: 00000000, D3: 00000000
Test1 CreateFileA function: 7649bde6
Test2 CreateFileA function: 7649bde6


Notice how the debug registers are all set to zero and both tests ran successfully.
Otherwise if you keep the Sleep(2000) in the code, you will get:

main D0: 7651fa5c, D1: 77e40204, D2: 00000000, D3: 00000000
Test1 CreateFileA function: 7649bde6


As you can see, the debug registers are set and the EAF protection is active, therefore the application crashes when running the second test:



I think that a better usage of the synchronization objects may avoid this race condition: for instance, implementing these routines using a critical section and two events would have probably been a safer alternative.


In this implementation, the main thread and every additional thread that is created, will add itself to the thread array (processed by the protector thread). The code to do this will be inside a critical section object: in this way, we ensure that if multiple threads are created, only one at the time will run the code to add itself to the threads array. Also, the critical section is a cheap synchronization object compared to the mutex used in the EMET implementation.
The protector thread is constantly waiting on "event 1", which is an event object: it is thus not wasting CPU cycles looping continuously, like the current EMET implementation does, it will only spawn and use the CPU when a new thread is created. In fact, a new thread will add itself to the threads array, and then will signal "event 1", waking up the protector thread. The new thread will then stop and wait for "event 2". Meanwhile, the protector thread has the time to process the threads array, and because of the structure of the code, it is sure that no other thread will be modifying it. Once EAF is activated, the protector thread signals "event 2" and then goes back to wait for "event 1". The signaled "event 2" will wake up the new thread, which will then continue its normal execution.

This implementation has several advantages respect the one from EMET:

  • The protector thread only uses resources when it has to.
  • Only one thread at the time modifies the thread array, avoiding the need for an array in the first place: the code could just use a single variable, avoiding an arbitrary size of 256, and also avoids the rare but possible condition of the array filling up before the protector thread spawns.
  • The new thread is guaranteed to be protected when it reaches the user code, avoiding the small window of vulnerability described in EMET's implementation.
I have not tested this code, but it should work and should not suffer from deadlocks. This could also be implemented in other ways, but you get the point I'm trying to make: you can use proper synchronization to make the code cleaner, more efficient and more elegant.

Now let's go back to the second thread: how exactly is EAF implemented? Let's recall that the hardware breakpoints are set by using the CPU debug registers.
EMET looks for every entry in the threads list, then successively opens and suspends each thread in order to modify their contexts using the SetThreadContext API.

As you can see from the image above, the AddressOfFunctions fields of the Export Tables of kernel32.dll and ntdll.dll are used to fill the DR0 and DR1 registers, while some appropriate flags are set in DR7.

These flags are:
  • L0, L1 used to activate the local breakpoints (meaning that they only work in the current thread);
  • LE used for backward compatibility reasons;
  • R/W0, R/W1 used to indicate if the breakpoint is set on read, write, or execute operations;
  • LEN0, LEN1 used to specify the size of the data on which the breakpoint acts.

In short: L0, L1, LE are set to 1 (which means that this flags are enabled); R/W0, R/W1 are set to 11 (which means that a breakpoint is set on data reads or writes); LEN0, LEN1 are set to 11 (referring to 4 bytes long breakpoints).
When these modifications are done, the thread is resumed and the EAF protection becomes active.

If you are interested in digging into the debug registers and how Windows handles them, I suggest you to read this article by Alex Ionescu.

At this point we have come so far that our description is almost complete, the only missing piece is the function being installed as an Exception Handler. Let's briefly recall that a function being passed as an Exception Handler must have the following prototype:
LONG CALLBACK VectoredHandler(
  _In_  PEXCEPTION_POINTERS ExceptionInfo
);

In particular, EMET accesses ExceptionInfo->ExceptionRecord->ExceptionFlags to filter the exception itself, making sure that it's a Single Step one (do remember that when an hardware breakpoint is hit the generated exception is of type Single Step). If it is, EMET disables all the active hardware breakpoints (that is, it sets to zero the L0, L1 flags in DR7). 

Then, it reads the context at the time the exception happened through ExceptionInfo->ContextRecord, and checks the four lowest bits in DR6 (B0 to B3): these bits indicate that a hardware breakpoint condition was met when a Single Step exception was raised (to distinguish it from the ones being generated when the Trap Flag is set).
Although, I'm quite sure that there's a little bug in performing this check:

.text:000546C4                 test    byte ptr [eax+CONTEXT.Dr6], 11h ; bug! 11h should be 3
.text:000546C8                 jz      short not_handled
.text:000546CA                 push    [eax+CONTEXT._Eip] ; reg_eip
.text:000546D0                 call    is_in_module
.text:000546D5                 test    eax, eax
.text:000546D7                 jnz     short not_handled
.text:000546D9                 push    edi
.text:000546DA                 push    1
.text:000546DC                 call    report_protection
.text:000546E1                 cmp     status_exploitaction, 1
.text:000546E8                 pop     ecx
.text:000546E9                 pop     ecx
.text:000546EA                 jnz     short not_handled
.text:000546EC                 push    1
.text:000546EE                 push    STATUS_STACK_BUFFER_OVERRUN
.text:000546F3                 push    dword ptr [edi+4]
.text:000546F6                 call    report_error_and_terminate
.text:000546FB not_handled:
  ...


In fact, EMET tests DR6 for 11 hex, which is 10001 in binary, corresponding to the B0 and the undocumented 5th bit that, according to the Intel's manuals, is always set to 1. I believe that this is a typo, and that the correct flag to be tested was 11 in binary (meaning 3 hex) that is both B0 and B1. 
This is not a serious issue, because DR1 is checked anyway, but it's really useless to let EMET handle a breakpoint that is not actually set. 

If one of the two hardware breakpoints was hit when the exception occurred, which may always be the case because of the buggy TEST instruction, EMET checks the value of the EIP register at that time (through ExceptionInfo->ContextRecord->EIP) to verify (using GetModuleHandleEx) if the instruction that caused the Single Step exception belonged to an executable module or not. If it didn't, the error is logged and if "status_exploitaction" is set (this variable corresponds to the "Stop on exploit/Audit only" customizable option available from the EMET's settings panel) a STATUS_STACK_BUFFER_OVERRUN is reported (through ExceptionInfo->ExceptionRecord->ExceptionCode), the exception is unhandled and the process is terminated. In all the other cases (that is if neither of the two bits in DR6 is set, or if the instruction reported in EIP did belong to an executable module, or if "status_exploitaction" isn't set) EMET disables all the bits in DR6 and activates the L0 and L1 flags in DR7 again to let the execution resume as if nothing happened.

Our journey through the EAF implementation is now over, but I would like to discuss briefly a couple of methods to bypass it. As declared by Microsoft, EAF wasn't meant as a definitive protection against unwanted access to the APIs addresses, but more as an obstacle for existing shellcodes.

One simple way to obtain such information, without any need to access the Export Table, is to use the Import Table instead. In particular, you can parse the Import Table of a DLL that is * importing * the desired API from kernel32.dll, or ntdll.dll and look for the OriginalFirstThunk and FirstThunk fields in the IMAGE_IMPORT_DESCRIPTOR structure. For example, User32.dll is loaded in almost every running process, and it imports both the LoadLibrary and GetProcAddress APIs, which are commonly used in shellcodes to get the addresses of other APIs.

Another method to bypass EAF is to use a specially crafted ROP gadget just to retrieve the AddressOfFunctions value. In this way, since you are reading the Export Table from a gadget that lies within an executable module, EMET won't detect anything suspicious and you can then find the addresses of all the needed APIs. Of course, EMET performs some security checks against ROP too, but since we need only one gadget it's not too difficult to find one that exploits the protection itself (or else, you may want to use a JOP gadget). For example, a shellcode may parse the Export Table of a module in order to find the pointer to the AddressOfFunctions field, put this pointer in the EAX register and then call a code gadget that does the following:

MOV       EAX, [EAX]
RET


This gadget is very short, it only requires three bytes of opcodes (8B 00 C3), so it should be very easy to find it inside most executable modules.

These are just two simple ideas that come to my mind, of course they are nothing new and surely you can find other ways to implement the trick. Moreover, these two methods assume that you already got rid of DEP and ASLR, which are the real pain when writing an exploit.

Note: the analysis was originally written in September 2013 for version 4.0, but it still holds for current version 4.1.