Here is a snapshot of the EMET configuration interface, where you can see all the available protections (including EAF):
EMET uses the Shims engine to inject its module inside all the protected processes: if you inspect a process (e.g. with Process Explorer) on which EMET is active, you will notice the presence of EMET.dll, which means that at least one protection is active for that process. So, EMET operates from the inside of the process in order to enable its protections, but, despite this "invasive" approach, I haven't noticed problems in performance or functionality. Some compatibility problems do exist (given the tricky nature of some protections), but they are well documented for all the most common software.
Let's start focusing on EAF itself. First, EMET protects EMET.dll by calling the GetModuleHandleEx API: if as its parameters you specify the flags GET_MODULE_HANDLE_EX_FLAG_FROM_ADDRESS and GET_MODULE_HANDLE_EX_FLAG_PIN, and an address inside the EMET.dll itself, as a result, the DLL will stay loaded until the process is terminated (no matter how many times FreeLibrary is called).
Then, EMET reads the Export Table of kernel32.dll and of ntdll.dll (the two DLLs being protected) and, in both cases, saves the AddressOfFunctions field (from the IMAGE_EXPORT_DIRECTORY structure) that contains the address at which all the exported APIs addresses are located. Having done that, EMET installs a global Exception Handler by calling the AddVectoredExceptionHandler API, which will be used to filter all the exceptions that occur when a hardware breakpoint is hit. I will describe this Exception Handler routine later.
Now EMET proceeds in activating the protection by forking the execution into two threads.
The main one uses the CreateToolhelp32Snapshot/Thread32First/Thread32Next APIs to get a list of all the running threads of the current process and saves them in an array:
.text:0005486D push 0FFFFFFFFh ; dwMilliseconds
.text:0005486F push array_mutex ; hHandle
.text:00054875 call ds:WaitForSingleObject
.text:0005487B mov eax, thread_count
.text:00054880 cmp eax, 256
.text:00054885 jnb short loc_54897
.text:00054887 mov ecx, [ebp+thread_id]
.text:0005488A mov tid_array[eax*4], ecx
.text:00054891 inc thread_count
The second one retrieves all the threads from the array and activates the hardware breakpoints on them in order to protect the AddressOfFunctions fields (one per DLL) mentioned above.
Such array has a hardcoded size of 256 DWORDS, but don't be disappointed: this is only a temporary buffer where the new threads are added until they are processed, and then removed, by the protector thread.
Moreover, EMET uses a mutex (actually saved as the first element of the array) to synchronize the access to the thread list, thus ensuring that all the newly added threads are processed before the array fills up with 256 of them:
Moreover, EMET uses a mutex (actually saved as the first element of the array) to synchronize the access to the thread list, thus ensuring that all the newly added threads are processed before the array fills up with 256 of them:
.text:00054906 Protector_Loop:
.text:00054906 push 100
.text:00054908 call ds:Sleep
.text:0005490E push 0FFFFFFFFh
.text:00054910 push array_mutex
.text:00054916 call ds:WaitForSingleObject
.text:0005491C mov ebx, thread_count
.text:00054922 test ebx, ebx
.text:00054924 jz short loc_5498B
...
Still, there is a curious race condition: there is a certain amount of time that passes between the creation of a thread, its insertion in the array and the activation of EAF in the protector thread. Due to this delay, new created threads (including the main application one) won't be protected by EAF in the initial time of their execution.
The Windows scheduler allows each thread to run only in a limited slot of time, after which the execution will be passed to other threads. In this way, in most scenarios, a new thread (including the main one) will run for some time before the execution will eventually yield to the protector thread, that will, then, activate the EAF protection. But what if this thread runs vulnerable code before the scheduler could allow the execution of the protector one? Is that possible? Well, in theory it is and, actually, this is also how I discovered the race condition in the first place.
I created a little application that accesses the AddressOfFunctions field of the kernel32.dll Export Table from a shellcode loaded outside executable modules (in the heap), prints it and then quits. I also activated EAF from the EMET tool. My application should have crashed, but instead it worked without any problem and I couldn't understand why. Moreover, I made my application print the hardware debug registers, and I noticed that the hardware breakpoints were never set. Debugging EMET.dll I discovered the race condition: so, I added a Sleep() in the entry point of my test application to give the EAF protector thread the time to run, and lo and behold, my application crashed as expected when the AddressOfFunctions field was read from the malicious shellcode.
I created a little application that accesses the AddressOfFunctions field of the kernel32.dll Export Table from a shellcode loaded outside executable modules (in the heap), prints it and then quits. I also activated EAF from the EMET tool. My application should have crashed, but instead it worked without any problem and I couldn't understand why. Moreover, I made my application print the hardware debug registers, and I noticed that the hardware breakpoints were never set. Debugging EMET.dll I discovered the race condition: so, I added a Sleep() in the entry point of my test application to give the EAF protector thread the time to run, and lo and behold, my application crashed as expected when the AddressOfFunctions field was read from the malicious shellcode.
The same holds if I do an analogous test on new created threads, not just the main one: there is a small window of vulnerability during the beginning of every thread, but it's very unlikely that an attacker will ever take advantage of it.
Here is the source code of my test application:
#include <Windows.h>
#include <stdio.h>
DWORD getApiAddress(void)
{
DWORD KernelImagebase, *pNames, *pAddresses, pCreateFile;
IMAGE_DOS_HEADER *pMZ;
IMAGE_NT_HEADERS *pPE;
IMAGE_EXPORT_DIRECTORY *pExpDir;
CHAR *currentName;
KernelImagebase = (DWORD)LoadLibrary(L"Kernel32.dll");
pMZ = (IMAGE_DOS_HEADER*)KernelImagebase;
pPE = (IMAGE_NT_HEADERS*)(KernelImagebase + pMZ->e_lfanew);
pExpDir = (IMAGE_EXPORT_DIRECTORY*)(KernelImagebase + pPE->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_EXPORT].VirtualAddress);
pNames = (DWORD*)(KernelImagebase + pExpDir->AddressOfNames);
pAddresses = (DWORD*)(KernelImagebase + pExpDir->AddressOfFunctions);
for(int i = 0; i < pExpDir->NumberOfNames; i++)
{
currentName = (CHAR*)(KernelImagebase + pNames[i]);
if(lstrcmpA(currentName, "CreateFileA") == 0)
{
pCreateFile = (DWORD)(KernelImagebase + pAddresses[i]);
}
}
return pCreateFile;
}
void main(void)
{
DWORD apiAddress;
Sleep(2000); // this delay will fix the race condition!
// print the debug registers
CONTEXT myContext;
memset(&myContext, 0, sizeof(myContext));
myContext.ContextFlags = CONTEXT_ALL;
HANDLE hThread = GetCurrentThread();
if(!GetThreadContext(hThread, &myContext)){
printf("cannot get thread context \n");
}
printf("main D0: %08x, D1: %08x, D2: %08x, D3: %08x\n",
myContext.Dr0, myContext.Dr1, myContext.Dr2, myContext.Dr3);
// test1: checking the export table of kernel32.dll from this executable module
apiAddress = getApiAddress();
printf("Test1 CreateFileA function: %08x \n", apiAddress);
// test2: checking the export table ok kernel32.dll from the heap
DWORD functionSize, pMain, pgetApiAddress;
pMain = DWORD(&main);
pgetApiAddress = DWORD(&getApiAddress);
functionSize = pMain - pgetApiAddress;
BYTE *shellcode = (BYTE*)malloc(functionSize);
memcpy(shellcode, (BYTE*)pgetApiAddress, functionSize);
__asm
{
mov ebx, shellcode
call ebx
mov apiAddress, eax
}
free(shellcode);
printf("Test2 CreateFileA function: %08x \n", apiAddress);
getchar();
}
Note: when you compile this code (I used Visual Studio), you must disable all the optimizations to avoid changes to the code layout, and also remove DEP from the linker options.
"test1" retrives the address of the CreateFileA API from inside the executable module; "test2" does the same from the heap.
If you don't add the Sleep(2000) in the main() function, you will get this output:
Notice how the debug registers are all set to zero and both tests ran successfully.
Otherwise if you keep the Sleep(2000) in the code, you will get:
As you can see, the debug registers are set and the EAF protection is active, therefore the application crashes when running the second test:
I think that a better usage of the synchronization objects may avoid this race condition: for instance, implementing these routines using a critical section and two events would have probably been a safer alternative.
In this implementation, the main thread and every additional thread that is created, will add itself to the thread array (processed by the protector thread). The code to do this will be inside a critical section object: in this way, we ensure that if multiple threads are created, only one at the time will run the code to add itself to the threads array. Also, the critical section is a cheap synchronization object compared to the mutex used in the EMET implementation.
The protector thread is constantly waiting on "event 1", which is an event object: it is thus not wasting CPU cycles looping continuously, like the current EMET implementation does, it will only spawn and use the CPU when a new thread is created. In fact, a new thread will add itself to the threads array, and then will signal "event 1", waking up the protector thread. The new thread will then stop and wait for "event 2". Meanwhile, the protector thread has the time to process the threads array, and because of the structure of the code, it is sure that no other thread will be modifying it. Once EAF is activated, the protector thread signals "event 2" and then goes back to wait for "event 1". The signaled "event 2" will wake up the new thread, which will then continue its normal execution.
This implementation has several advantages respect the one from EMET:
Now let's go back to the second thread: how exactly is EAF implemented? Let's recall that the hardware breakpoints are set by using the CPU debug registers.
"test1" retrives the address of the CreateFileA API from inside the executable module; "test2" does the same from the heap.
If you don't add the Sleep(2000) in the main() function, you will get this output:
main D0: 00000000, D1: 00000000, D2: 00000000, D3: 00000000
Test1 CreateFileA function: 7649bde6
Test2 CreateFileA function: 7649bde6
Notice how the debug registers are all set to zero and both tests ran successfully.
Otherwise if you keep the Sleep(2000) in the code, you will get:
main D0: 7651fa5c, D1: 77e40204, D2: 00000000, D3: 00000000
Test1 CreateFileA function: 7649bde6
I think that a better usage of the synchronization objects may avoid this race condition: for instance, implementing these routines using a critical section and two events would have probably been a safer alternative.
In this implementation, the main thread and every additional thread that is created, will add itself to the thread array (processed by the protector thread). The code to do this will be inside a critical section object: in this way, we ensure that if multiple threads are created, only one at the time will run the code to add itself to the threads array. Also, the critical section is a cheap synchronization object compared to the mutex used in the EMET implementation.
The protector thread is constantly waiting on "event 1", which is an event object: it is thus not wasting CPU cycles looping continuously, like the current EMET implementation does, it will only spawn and use the CPU when a new thread is created. In fact, a new thread will add itself to the threads array, and then will signal "event 1", waking up the protector thread. The new thread will then stop and wait for "event 2". Meanwhile, the protector thread has the time to process the threads array, and because of the structure of the code, it is sure that no other thread will be modifying it. Once EAF is activated, the protector thread signals "event 2" and then goes back to wait for "event 1". The signaled "event 2" will wake up the new thread, which will then continue its normal execution.
This implementation has several advantages respect the one from EMET:
- The protector thread only uses resources when it has to.
- Only one thread at the time modifies the thread array, avoiding the need for an array in the first place: the code could just use a single variable, avoiding an arbitrary size of 256, and also avoids the rare but possible condition of the array filling up before the protector thread spawns.
- The new thread is guaranteed to be protected when it reaches the user code, avoiding the small window of vulnerability described in EMET's implementation.
Now let's go back to the second thread: how exactly is EAF implemented? Let's recall that the hardware breakpoints are set by using the CPU debug registers.
EMET looks for every entry in the threads list, then successively opens and suspends each thread in order to modify their contexts using the SetThreadContext API.
As you can see from the image above, the AddressOfFunctions fields of the Export Tables of kernel32.dll and ntdll.dll are used to fill the DR0 and DR1 registers, while some appropriate flags are set in DR7.
These flags are:
- L0, L1 used to activate the local breakpoints (meaning that they only work in the current thread);
- LE used for backward compatibility reasons;
- R/W0, R/W1 used to indicate if the breakpoint is set on read, write, or execute operations;
- LEN0, LEN1 used to specify the size of the data on which the breakpoint acts.
In short: L0, L1, LE are set to 1 (which means that this flags are enabled); R/W0, R/W1 are set to 11 (which means that a breakpoint is set on data reads or writes); LEN0, LEN1 are set to 11 (referring to 4 bytes long breakpoints).
When these modifications are done, the thread is resumed and the EAF protection becomes active.
When these modifications are done, the thread is resumed and the EAF protection becomes active.
If you are interested in digging into the debug registers and how Windows handles them, I suggest you to read this article by Alex Ionescu.
At this point we have come so far that our description is almost complete, the only missing piece is the function being installed as an Exception Handler. Let's briefly recall that a function being passed as an Exception Handler must have the following prototype:
LONG CALLBACK VectoredHandler( _In_ PEXCEPTION_POINTERS ExceptionInfo );
In particular, EMET accesses ExceptionInfo->ExceptionRecord->ExceptionFlags to filter the exception itself, making sure that it's a Single Step one (do remember that when an hardware breakpoint is hit the generated exception is of type Single Step). If it is, EMET disables all the active hardware breakpoints (that is, it sets to zero the L0, L1 flags in DR7).
Then, it reads the context at the time the exception happened through ExceptionInfo->ContextRecord, and checks the four lowest bits in DR6 (B0 to B3): these bits indicate that a hardware breakpoint condition was met when a Single Step exception was raised (to distinguish it from the ones being generated when the Trap Flag is set).
Although, I'm quite sure that there's a little bug in performing this check:
.text:000546C4 test byte ptr [eax+CONTEXT.Dr6], 11h ; bug! 11h should be 3
.text:000546C8 jz short not_handled
.text:000546CA push [eax+CONTEXT._Eip] ; reg_eip
.text:000546D0 call is_in_module
.text:000546D5 test eax, eax
.text:000546D7 jnz short not_handled
.text:000546D9 push edi
.text:000546DA push 1
.text:000546DC call report_protection
.text:000546E1 cmp status_exploitaction, 1
.text:000546E8 pop ecx
.text:000546E9 pop ecx
.text:000546EA jnz short not_handled
.text:000546EC push 1
.text:000546EE push STATUS_STACK_BUFFER_OVERRUN
.text:000546F3 push dword ptr [edi+4]
.text:000546F6 call report_error_and_terminate
.text:000546FB not_handled:
...
In fact, EMET tests DR6 for 11 hex, which is 10001 in binary, corresponding to the B0 and the undocumented 5th bit that, according to the Intel's manuals, is always set to 1. I believe that this is a typo, and that the correct flag to be tested was 11 in binary (meaning 3 hex) that is both B0 and B1.
This is not a serious issue, because DR1 is checked anyway, but it's really useless to let EMET handle a breakpoint that is not actually set.
Our journey through the EAF implementation is now over, but I would like to discuss briefly a couple of methods to bypass it. As declared by Microsoft, EAF wasn't meant as a definitive protection against unwanted access to the APIs addresses, but more as an obstacle for existing shellcodes.
One simple way to obtain such information, without any need to access the Export Table, is to use the Import Table instead. In particular, you can parse the Import Table of a DLL that is * importing * the desired API from kernel32.dll, or ntdll.dll and look for the OriginalFirstThunk and FirstThunk fields in the IMAGE_IMPORT_DESCRIPTOR structure. For example, User32.dll is loaded in almost every running process, and it imports both the LoadLibrary and GetProcAddress APIs, which are commonly used in shellcodes to get the addresses of other APIs.
Another method to bypass EAF is to use a specially crafted ROP gadget just to retrieve the AddressOfFunctions value. In this way, since you are reading the Export Table from a gadget that lies within an executable module, EMET won't detect anything suspicious and you can then find the addresses of all the needed APIs. Of course, EMET performs some security checks against ROP too, but since we need only one gadget it's not too difficult to find one that exploits the protection itself (or else, you may want to use a JOP gadget). For example, a shellcode may parse the Export Table of a module in order to find the pointer to the AddressOfFunctions field, put this pointer in the EAX register and then call a code gadget that does the following:
MOV EAX, [EAX]
RET
This gadget is very short, it only requires three bytes of opcodes (8B 00 C3), so it should be very easy to find it inside most executable modules.
These are just two simple ideas that come to my mind, of course they are nothing new and surely you can find other ways to implement the trick. Moreover, these two methods assume that you already got rid of DEP and ASLR, which are the real pain when writing an exploit.
Note: the analysis was originally written in September 2013 for version 4.0, but it still holds for current version 4.1.
This is not a serious issue, because DR1 is checked anyway, but it's really useless to let EMET handle a breakpoint that is not actually set.
If one of the two hardware breakpoints was hit when the exception occurred, which may always be the case because of the buggy TEST instruction, EMET checks the value of the EIP register at that time (through ExceptionInfo->ContextRecord->EIP) to verify (using GetModuleHandleEx) if the instruction that caused the Single Step exception belonged to an executable module or not. If it didn't, the error is logged and if "status_exploitaction" is set (this variable corresponds to the "Stop on exploit/Audit only" customizable option available from the EMET's settings panel) a STATUS_STACK_BUFFER_OVERRUN is reported (through ExceptionInfo->ExceptionRecord->ExceptionCode), the exception is unhandled and the process is terminated. In all the other cases (that is if neither of the two bits in DR6 is set, or if the instruction reported in EIP did belong to an executable module, or if "status_exploitaction" isn't set) EMET disables all the bits in DR6 and activates the L0 and L1 flags in DR7 again to let the execution resume as if nothing happened.
Our journey through the EAF implementation is now over, but I would like to discuss briefly a couple of methods to bypass it. As declared by Microsoft, EAF wasn't meant as a definitive protection against unwanted access to the APIs addresses, but more as an obstacle for existing shellcodes.
One simple way to obtain such information, without any need to access the Export Table, is to use the Import Table instead. In particular, you can parse the Import Table of a DLL that is * importing * the desired API from kernel32.dll, or ntdll.dll and look for the OriginalFirstThunk and FirstThunk fields in the IMAGE_IMPORT_DESCRIPTOR structure. For example, User32.dll is loaded in almost every running process, and it imports both the LoadLibrary and GetProcAddress APIs, which are commonly used in shellcodes to get the addresses of other APIs.
Another method to bypass EAF is to use a specially crafted ROP gadget just to retrieve the AddressOfFunctions value. In this way, since you are reading the Export Table from a gadget that lies within an executable module, EMET won't detect anything suspicious and you can then find the addresses of all the needed APIs. Of course, EMET performs some security checks against ROP too, but since we need only one gadget it's not too difficult to find one that exploits the protection itself (or else, you may want to use a JOP gadget). For example, a shellcode may parse the Export Table of a module in order to find the pointer to the AddressOfFunctions field, put this pointer in the EAX register and then call a code gadget that does the following:
MOV EAX, [EAX]
RET
This gadget is very short, it only requires three bytes of opcodes (8B 00 C3), so it should be very easy to find it inside most executable modules.
These are just two simple ideas that come to my mind, of course they are nothing new and surely you can find other ways to implement the trick. Moreover, these two methods assume that you already got rid of DEP and ASLR, which are the real pain when writing an exploit.
Note: the analysis was originally written in September 2013 for version 4.0, but it still holds for current version 4.1.