Tuesday, November 11, 2014

Solution to some of "The Windows kernel" exercises from Practical Reverse Engineering

Recently I spent some time improving my knowledge of the Windows kernel and I gave a go at some of the exercises from the "Practical Reverse Engineering" book. I wanted to share the solutions to the ones proposed on page 180 ("Building Knowledge and Solidifying Your Knowledge" from "The Windows Kernel" chapter), as I think this can be an opportunity to discuss them with other people :) Hopefully I will be able to share more solutions in future blog entries! :)

  • First explanation: if a driver thread runs at DISPATCH_LEVEL, it cannot be interrupted (or pre-empted) by other code from lower IRQLs (that are PASSIVE_LEVEL and APC_LEVEL). If such a thread accesses a page that has been swapped to disk, the page fault handler will generate an I/O operation to read the page from the disk and restore it in memory... and here is the catch: the completion routine for I/O runs in APC_LEVEL, which means that the I/O for the swapped page won't be able to complete, causing a BSOD.
  • Second explanation: page faults at DISPATCH_LEVEL may not be caught by exception handlers. Even if a driver encloses a "dangerous" piece of code inside a try/except statement, if such code generates for example a page fault in non-paged memory, the fault will not be dispatched to the exception handler and the system will throw a blue screen. Thus, even writing "careful" code that protects pointers' dereference with exception handlers will not mitigate blue screens. Here is a quote from a minidump generated by a PAGE_FAULT_IN_NONPAGED_AREA:
    Invalid system memory was referenced.  This cannot be protected by try-except, it must be protected by a Probe.
    Note that there are page faults that can be safely caught by exception handlers, e.g. access violations to usermode pointers.


Thread priority is used by the scheduler to decide which thread must be run next and for how long. The operation of interrupting a thread to schedule the next one is called thread dispatching. However, the thread dispatcher only operates at PASSIVE_LEVEL or APC_LEVEL. Thus, if a thread is at DISPATCH_LEVEL or above, the thread dispatcher won't be able to pre-empt it and the thread "priority" will depend on the IRQL itself. Therefore, in this sense, code in kernel mode that runs at DISPATCH_LEVEL can run "faster", but having a heavy workload at high IRQL will cause a general degradation of performance for the whole system. Code that runs at high IRQL should be designed to run and finish its task as quickly as possible in order not to degrade performance in general.


The following code monitors the creation and termination of threads and processes, and the loading and unloading of executable modules.

 #include <ntddk.h>  
 #include <string.h>  
 #ifdef ALLOC_PRAGMA  
 #pragma alloc_text( INIT, DriverEntry )  
 VOID ImageHandler(  
   __in_opt PUNICODE_STRING FullImageName,  
   __in HANDLE ProcessId,  
   __in PIMAGE_INFO ImageInfo  
      DbgPrint("ImageBase: %016x\n", ImageInfo->ImageBase);  
 VOID ProcessHandler(  
   IN HANDLE ParentId,  
   IN HANDLE ProcessId,  
   IN BOOLEAN Create  
      DbgPrint("Pid: %08x\n", ProcessId);  
 VOID ThreadHandler(  
   IN HANDLE ProcessId,  
   IN HANDLE ThreadId,  
   IN BOOLEAN Create  
      DbgPrint("Tid: %08x\n", ThreadId);  
 VOID MyUnload(  
      __in PDRIVER_OBJECT DriverObject  
      DbgPrint("Unload routine\n");  
      if(PsRemoveLoadImageNotifyRoutine(&ImageHandler) != STATUS_SUCCESS ||  
           PsSetCreateProcessNotifyRoutine(&ProcessHandler, TRUE) != STATUS_SUCCESS ||  
           PsRemoveCreateThreadNotifyRoutine(&ThreadHandler) != STATUS_SUCCESS)  
           DbgPrint("Error in unregistering notify routines\n");  
 NTSTATUS DriverEntry(__in PDRIVER_OBJECT DriverObject, __in PUNICODE_STRING RegistryPath)  
      NTSTATUS StatusTemp;  
      DbgPrint("Driver started\n");  
      DriverObject->DriverUnload = &MyUnload;  
      if( PsSetLoadImageNotifyRoutine(&ImageHandler) != STATUS_SUCCESS )   
           DbgPrint("Error registering ImageHandler\n");  
      StatusTemp = PsSetCreateProcessNotifyRoutine(&ProcessHandler, FALSE);  
      if( StatusTemp != STATUS_SUCCESS )  
           DbgPrint("Error registering ProcessHandler\n");  
      StatusTemp = PsSetCreateThreadNotifyRoutine(&ThreadHandler);  
      if(StatusTemp != STATUS_SUCCESS)  
           DbgPrint("Error registering ThreadHandler\n");  
      return STATUS_SUCCESS;  

An example of the output it produces:

ImageBase: 00000000fcae0000
ImageBase: 00000000fcae0000
Tid: 00000e64
Tid: 000003dc
Tid: 000006c4
Tid: 000005b8
Pid: 00000f48
Tid: 00000da8
Tid: 00000ac8
Tid: 00000c6c
Tid: 00000870
Tid: 00000d20
Tid: 000006b4
Tid: 00000448
Tid: 00000ed4
Tid: 00000d8c
Pid: 00000d50
Tid: 000005b0
ImageBase: 00000000ff330000
ImageBase: 0000000077bd0000
ImageBase: 0000000077ab0000
ImageBase: 00000000fdbc0000
ImageBase: 00000000ffe00000
ImageBase: 00000000ff4d0000
ImageBase: 00000000fe180000
ImageBase: 00000000fe1b0000
Unload routine


If METHOD_NEITHER is employed, the I/O manager does not provide any validation on the buffer that is received from the usermode application, nor it generates an MDL for it (as opposed to METHOD_BUFFERED and METHOD_[IN/OUT]_DIRECT). This means that the address of the usermode buffer is passed directly to the target driver, that is in charge  of performing all the necessary validations. A driver should probe the pages to make them resident and to verify that it has the requested access (e.g. read/write) to the user buffer. Then, it should lock the pages (so that they won't be paged out to the swap file) and should access the data, making sure that the code that does these operations is enclosed in a try/except block to handle potential exceptions (e.g. if a usermode thread changes the memory protection of the buffer while the driver is accessing it). The use  of METHOD_NEITHER also implies that the driver code that carries out the validation of the buffer needs to run in the same context of the thread that generated the IOCTL request (for example, if the IRP is queued for later processing, it may be processed in an arbitrary thread/process context, thus the address of the user buffer may not refer to the correct virtual memory any more).


The address I choose to translate: f897a000 (it's the base address of a driver). 

lkd> db f897a000
f897a000  4d 5a 90 00 03 00 00 00-04 00 00 00 ff ff 00 00  MZ..............
f897a010  b8 00 00 00 00 00 00 00-40 00 00 00 00 00 00 00  ........@.......
f897a020  00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00  ................
f897a030  00 00 00 00 00 00 00 00-00 00 00 00 e0 00 00 00  ................
f897a040  0e 1f ba 0e 00 b4 09 cd-21 b8 01 4c cd 21 54 68  ........!..L.!Th
f897a050  69 73 20 70 72 6f 67 72-61 6d 20 63 61 6e 6e 6f  is program canno
f897a060  74 20 62 65 20 72 75 6e-20 69 6e 20 44 4f 53 20  t be run in DOS 
f897a070  6d 6f 64 65 2e 0d 0d 0a-24 00 00 00 00 00 00 00  mode....$.......

This address corresponds to a 4K page in a x86 system with PAE enabled. I break down the virtual address in its components according to the Intel's documentation:

f897a000   hex
11111000100101111010000000000000   bin

11    111000100   101111010   000000000000  (broken out)
PDPE     PDE         PTE         Offset
2bit    9 bits      9 bits      12 bits
 3    1c4h 452d   17ah 378d        0

Next, I need a page directory in order to perform the translation. Since the address I chose is a kernelmode one, I can get the page directory from any process: in this case I used the one from "alg.exe". I get it by listing all the processes with:

!process 0 0

PROCESS 820a8240  SessionId: 0  Cid: 04e4    Peb: 7ffde000  ParentCid: 029c
    DirBase: 081002e0  ObjectTable: e1f36ae0  HandleCount: 105.
    Image: alg.exe

I can double check it in the related EPROCESS structure, the page directory is stored in the DirectoryTableBase member:

lkd> dt -b nt!_EPROCESS 820a8240  
   +0x000 Pcb              : _KPROCESS
      +0x000 Header           : _DISPATCHER_HEADER
         +0x000 Type             : 0x3 ''
         +0x001 Absolute         : 0 ''
         +0x002 Size             : 0x1b ''
         +0x003 Inserted         : 0 ''
         +0x004 SignalState      : 0n0
         +0x008 WaitListHead     : _LIST_ENTRY [ 0x81a97398 - 0x81a97398 ]
            +0x000 Flink            : 0x81a97398 
            +0x004 Blink            : 0x81a97398 
      +0x010 ProfileListHead  : _LIST_ENTRY [ 0x820a8250 - 0x820a8250 ]
         +0x000 Flink            : 0x820a8250 
         +0x004 Blink            : 0x820a8250 
      +0x018 DirectoryTableBase
       [00] 0x81002e0
       [01] 0xf454
      +0x020 LdtDescriptor    : _KGDTENTRY

The first step consists in retrieving the address of the page directory from the PDPE array.
PDPE entry number 3: 81002e0 + 3*8 = 81002F8

lkd> !dc 081002e0  
# 81002e0   0f691001 00000000 0f452001 00000000 ..i...... E.....
# 81002f0   0f693001 00000000 0f610001 00000000 .0i.......a.....
# 8100300   133db001 00000000 1ea1c001 00000000 ..=.............

The value I am looking for is 0f610001. The lower 12 bits are used to store flags and other information, thus they can be ignored and set to zero to obtain the page address, which is 0f610000.

In the second step, I locate the address of the page table from the page directory.
PDE entry number 1C4: 0f610000 + 1c4*8 = F610E20.

lkd> !dc F610E20
# f610e20 01032163 00000000 01033163 00000000 c!......c1......
# f610e30 01034163 00000000 01035163 00000000 cA......cQ......

I obtain the value 01032163, from which I zero out the lower 12 bits to obtain the page address 01032000.

The third step gives me the address of the the memory page from the page table.
PTE entry number 17a: 01032000 + 17a*8 = 1032BD0.

lkd> !dc 1032BD0
# 1032bd0 0437b163 00000000 0437c121 00000000 c.7.....!.7.....
# 1032be0 0437d121 00000000 0437e163 00000000 !.7.....c.7.....

The value is 0437b163, that is 0437b000 when the lower 12 bits are zeroed. This is the physical address of the memory page.

In the final step I add the "offset" part of the virtual address to the address I just obtained in step 3 in order to have the final physical address of the data I'm looking for. In this case, the offset is zero, thus the address remains 0437b000. I can verify that it is correct by visualizing the data pointed by the physical address:

lkd> !dc 0437b000
# 437b000 00905a4d 00000003 00000004 0000ffff MZ..............
# 437b010 000000b8 00000000 00000040 00000000 ........@.......
# 437b020 00000000 00000000 00000000 00000000 ................
# 437b030 00000000 00000000 00000000 000000e0 ................
# 437b040 0eba1f0e cd09b400 4c01b821 685421cd ........!..L.!Th
# 437b050 70207369 72676f72 63206d61 6f6e6e61 is program canno
# 437b060 65622074 6e757220 206e6920 20534f44 t be run in DOS 
# 437b070 65646f6d 0a0d0d2e 00000024 00000000 mode....$.......

This is the same data that I saw earlier visualizing the virtual address f897a000. A further verification can be done with vtop:

lkd> !vtop 0 f897a000
X86VtoP: Virt f897a000, pagedir 8100320
X86VtoP: PAE PDPE 8100338 - 000000001cd53001
X86VtoP: PAE PDE 1cd53e20 - 0000000001032163
X86VtoP: PAE PTE 1032bd0 - 000000000437b163
X86VtoP: PAE Mapped phys 437b000
Virtual address f897a000 translates to physical address 437b000.

My system also supports big 2Mb pages, thus I decided to perform the same virtual-to-physical translation on a 2Mb page. The operating system uses big pages for the kernel, the hal and the non-paged pool, hence I chose the address of the NtCreateFile api that is located inside the kernel (therefore inside a 2Mb page).
The address to translate is: 8056d14c (NtCreateFile)
Similarly to the previous translation, I start by identifying the components of the virtual address:

8056d14c   hex
10000000010101101101000101001100  bin

 10      000000010  101101101000101001100   (broken out)
PDPE       PDE            Offset
2 bits    9 bits          21 bits
  2         2             16D14Ch

The only difference is that this time the translation is one passage shorter (the PTE disappears). Again, I use the page directory from the "alg.exe" process.
I start by extracting the page directory.
PDPE entry number 2: 081002e0 + 2*8 = 81002f0

lkd> !dc 81002f0
# 81002f0 0f693001 00000000 0f610001 00000000 .0i.......a.....
# 8100300 133db001 00000000 1ea1c001 00000000 ..=.............

I obtain the address 0f693000, which I use to retrieve the physical address of the memory page:
PDE entry number 2: 0f693000 + 2*8 = 0f693010

lkd> !dc 0f693010
# f693010 004001e3 00000000 006001e3 00000000 ..@.......`.....
# f693020 00b08163 00000000 00b09163 00000000 c.......c.......

And I end up with 00400000, to which I need to add the offset:

Offset 16D14Ch: 00400000 + 16D14C = 56d14c

I can assess the correctness of the physical address by verifying that the data I see reading the virtual address is the same I see reading the physical address:

lkd> !dc 56d14c
#  56d14c 8b55ff8b 50c033ec 75ff5050 2c75ff30 ..U..3.PPP.u0.u,
#  56d15c ff2875ff 75ff2475 1c75ff20 ff1875ff .u(.u$.u .u..u..

lkd> db 8056d14c 
8056d14c  8b ff 55 8b ec 33 c0 50-50 50 ff 75 30 ff 75 2c  ..U..3.PPP.u0.u,
8056d15c  ff 75 28 ff 75 24 ff 75-20 ff 75 1c ff 75 18 ff  .u(.u$.u .u..u..


The following code performs various operations on a list.

 #include <ntddk.h>  
 #include <string.h>  
 typedef struct _mydata  
      int a;  
      char *s;  
      LIST_ENTRY mylist;  
 } mydata;  
 #ifdef ALLOC_PRAGMA  
 #pragma alloc_text( INIT, DriverEntry )  
 VOID MyUnload(  
      __in PDRIVER_OBJECT DriverObject  
      DbgPrint("Unload routine\n");  
   __in PDRIVER_OBJECT  DriverObject,  
   __in PUNICODE_STRING RegistryPath  
      mydata data;  
      LIST_ENTRY myhead;  
      mydata data1, data2, data3;  
      LIST_ENTRY *plist;  
      int i = 0;  
      DriverObject->DriverUnload = &MyUnload;  
      data1.a = 1;  
      data1.s = "string one";  
      data2.a = 2;  
      data2.s = "string two";  
      data3.a = 3;  
      data3.s = "string three";  
      InsertHeadList(&myhead, &(data1.mylist));  
      InsertTailList(&myhead, &(data2.mylist));  
      InsertTailList(&myhead, &(data3.mylist));  
           DbgPrint("List is empty after initialization\n");  
           DbgPrint("List is not empty after initialization\n");  
      plist = myhead.Flink;  
      while(plist != &myhead)  
           if(i++ >20)  
           DbgPrint("entry: %d ", (CONTAINING_RECORD(plist, mydata, mylist))->a );  
           DbgPrint("string: %s \n", (CONTAINING_RECORD(plist, mydata, mylist))->s );  
           plist = plist->Flink;  
           DbgPrint("List is empty at the end of DriverEntry\n");  
           DbgPrint("List is not empty at the end of DriverEntry\n");  
      return STATUS_SUCCESS;  

This is an example of the output it produces:

List is not empty after initialization
entry: 1 string: string one 
entry: 2 string: string two 
entry: 3 string: string three 
List is empty at the end of DriverEntry
Unload routine

I identified the following inline functions in the decompiled code:

InsertHeadList(&myhead, &(data1.mylist))
InsertTailList(&myhead, &(data2.mylist))
InsertTailList(&myhead, &(data3.mylist))

(because of compiler optimizations, these routines have been "merged" to eliminate unnecessary instructions, thus the boundaries of each routine are not very well defined any more)

lea     rax, [r11-78h]
mov     [r11-30h], rax  ; copy &myhead.mylist in data1->Blink
lea     rax, [r11-38h]
mov     [r11-78h], rax  ; copy &data1.mylist in myhead->Flink    
lea     rax, [r11-38h]
mov     [r11-50h], rax  ; copy &data1.mylist in data2->Blink
lea     rax, [r11-58h]
mov     [r11-38h], rax  ; copy &data2.mylist in data1->Flink    
lea     rax, [r11-78h]
mov     [r11-18h], rax  ; copy &myhead.mylist in data3->Flink
lea     rax, [r11-58h]
mov     [r11-10h], rax  ; copy &data2.mylist in data3->Blink
lea     rax, [r11-18h]
mov     [r11-58h], rax  ; copy &data3.mylist in data2->Flink
lea     rax, [r11-18h]
mov     [r11-70h], rax  ; copy &data3.mylist in myhead->Blink


mov     rax, [rsp+48h]  
mov     rcx, [rsp+40h]  
mov     [rax], rcx
mov     [rcx+8], rax


mov     rax, [rsp+98h+var_78]
mov     rcx, [rax]
lea     rax, [rsp+98h+var_78]
mov     [rsp+98h+var_78], rcx
mov     [rcx+8], rax


mov     rax, [rsp+98h+var_70]
mov     rcx, [rax+8]
lea     rax, [rsp+98h+var_78]
mov     [rsp+98h+var_70], rcx
mov     [rcx], rax


lea     rax, [rsp+98h+var_78]
cmp     [rsp+98h+var_78], rax

It is possible to see a pattern in these routines: since the operations they perform is always the same, the logic of the computation of the assembly instructions will be the same too. However, when the routines are compiled, their instructions may use different operands (e.g. registers, stack variables or global offsets). The instructions' order may differ as well, they may be interleaved with instructions from other computations, or the compiler optimizations may merge some of them. In conclusion, although the "logical" pattern of a single routine is pretty much constant, its assembly implementation may have many variations that make the task of recognizing patterns quite complicated.


The following code illustrates an example of use of tables/trees and bitmaps.

 #include <ntddk.h>  
 #include <string.h>  
 #ifdef ALLOC_PRAGMA  
 #pragma alloc_text( INIT, DriverEntry )  
 typedef struct _node  
      int code;  
      char *p;  
 } node;  
 VOID MyUnload(__in PDRIVER_OBJECT DriverObject)  
      DbgPrint("Unload routine\n");  
 CompareRoutine (  
   __in struct _RTL_GENERIC_TABLE *Table,  
   __in PVOID FirstStruct,  
   __in PVOID SecondStruct  
  if(((node *)FirstStruct)->code < ((node *)SecondStruct)->code) return GenericLessThan;  
  if(((node *)FirstStruct)->code > ((node *)SecondStruct)->code) return GenericGreaterThan;  
  if(((node *)FirstStruct)->code == ((node *)SecondStruct)->code) return GenericEqual;  
  return 0;  
 AllocateRoutine (  
   __in struct _RTL_GENERIC_TABLE *Table,  
   __in CLONG ByteSize  
      return ExAllocatePoolWithTag(NonPagedPool, ByteSize, 'tag2');  
 FreeRoutine (  
   __in struct _RTL_GENERIC_TABLE *Table,  
   __in PVOID Buffer  
      ExFreePoolWithTag(Buffer, 'tag2');  
   __in PDRIVER_OBJECT  DriverObject,  
   __in PUNICODE_STRING RegistryPath  
      void *generic_table;  
      BOOLEAN success;  
      node *ptNode;  
      node entry_one;  
      node entry_two;  
      node entry_three;  
      RTL_BITMAP rtl_bitmap;  
      ULONG *bitmap_buffer;  
      ULONG bitmap_size = 48;  
      ULONG bit_index, num_bits;  
      DbgPrint("Driver started\n");  
      DriverObject->DriverUnload = &MyUnload;  
      entry_one.code = 1;  
      entry_one.p = "string one";  
      entry_two.code = 2;  
      entry_two.p = "string two";  
      entry_three.code = 3;  
      entry_three.p = "string three";  
      generic_table = ExAllocatePoolWithTag(NonPagedPool, sizeof(RTL_GENERIC_TABLE), 'tag1');  
      if(generic_table == NULL)  
           DbgPrint("Error in memory allocation\n");  
      RtlInitializeGenericTable(generic_table, &CompareRoutine, &AllocateRoutine, &FreeRoutine, NULL);  
      if(RtlInsertElementGenericTable(generic_table, &entry_one, sizeof(entry_one), &success) == NULL)  
           DbgPrint("Error in first insert\n");  
      if(RtlInsertElementGenericTable(generic_table, &entry_two, sizeof(entry_two), &success) == NULL)  
           DbgPrint("Error in second insert\n");  
      if(RtlInsertElementGenericTable(generic_table, &entry_three, sizeof(entry_three), &success) == NULL)  
           DbgPrint("Error in third insert\n");  
      DbgPrint("Number of elements in the table after initialization: %d\n", RtlNumberGenericTableElements(generic_table));  
      for (ptNode = RtlEnumerateGenericTable(generic_table, TRUE); ptNode != NULL; ptNode = RtlEnumerateGenericTable(generic_table, FALSE))   
                DbgPrint("Enumerating element: %d with string: %s \n", ptNode->code, ptNode->p);  
      if( ( ptNode = RtlLookupElementGenericTable(generic_table, &entry_two) ) != NULL)  
           DbgPrint("Lookup element: %d with string: %s \n", ptNode->code, ptNode->p);  
           DbgPrint("Error in lookup element \n");  
      if(RtlDeleteElementGenericTable(generic_table, &entry_three) == TRUE)  
           DbgPrint("Element three was correctly deleted!\n");  
           DbgPrint("Error in deleting element three\n");  
      DbgPrint("Number of elements in the table after removing element three: %d\n", RtlNumberGenericTableElements(generic_table));  
      if(RtlDeleteElementGenericTable(generic_table, &entry_two) == TRUE)  
           DbgPrint("Element two was correctly deleted!\n");  
           DbgPrint("Error in deleting element two\n");  
      if(RtlDeleteElementGenericTable(generic_table, &entry_one) == TRUE)  
           DbgPrint("Element one was correctly deleted!\n");  
           DbgPrint("Error in deleting element one\n");  
      if(RtlNumberGenericTableElements(generic_table) == 0)  
           DbgPrint("Table is empty and will be deallocated\n");  
           ExFreePoolWithTag(generic_table, 'tag1');  
           DbgPrint("Some error occurred when freeing the list\n");  
      // bitmap  
      bitmap_buffer = ExAllocatePoolWithTag(NonPagedPool, 2*sizeof(ULONG), 'tag3');  
      if(bitmap_buffer == NULL)  
           DbgPrint("Error in memory allocation\n");  
      RtlInitializeBitMap(&rtl_bitmap, bitmap_buffer, bitmap_size);  
      DbgPrint("Number of bits set after initialization: %d\n", RtlNumberOfSetBits(&rtl_bitmap));  
      RtlSetBits(&rtl_bitmap, 10, 5);  
      num_bits = RtlFindLongestRunClear(&rtl_bitmap, &bit_index);  
      DbgPrint("After setting bit ten to fifteen, the longest run of clear bits is at position %d size %d\n", bit_index, num_bits);  
      DbgPrint("The number of clear bits is: %d\n", RtlNumberOfClearBits(&rtl_bitmap));  
      RtlSetBit(&rtl_bitmap, 37);  
      DbgPrint("Are bits from 30 to 35 clear? %d \n", RtlAreBitsClear(&rtl_bitmap, 30, 5));  
      DbgPrint("Are bits from 30 to 40 clear? %d \n", RtlAreBitsClear(&rtl_bitmap, 30, 10));  
      ExFreePoolWithTag(bitmap_buffer, 'tag3');  
      return STATUS_SUCCESS;  

This is the output it produces:

Driver started
Number of elements in the table after initialization: 3
Enumerating element: 1  with string: string one 
Enumerating element: 2  with string: string two 
Enumerating element: 3  with string: string three 
Lookup element: 2  with string: string two 
Element three was correctly deleted!
Number of elements in the table after removing element three: 2
Element two was correctly deleted!
Element one was correctly deleted!
Table is empty and will be deallocated
Number of bits set after initialization: 0
After setting bit ten to fifteen, the longest run of clear bits is at position 15 size  33
The number of clear bits is: 43
Are bits from 30 to 35 clear? 1 
Are bits from 30 to 40 clear? 0 
Unload routine

This, instead, is an example of code that uses hash table APIs.

 #include <ntddk.h>  
 #include <string.h>  
 #ifdef ALLOC_PRAGMA  
 #pragma alloc_text( INIT, DriverEntry )  
 VOID MyUnload(__in PDRIVER_OBJECT DriverObject)  
      DbgPrint("Unload routine\n");  
   __in PDRIVER_OBJECT  DriverObject,  
   __in PUNICODE_STRING RegistryPath  
      RTL_DYNAMIC_HASH_TABLE *hash_table;  
      RTL_DYNAMIC_HASH_TABLE_ENTRY entry1, entry2, entry3, *result_hash;  
      ULONG sig = 1234;  
      BOOLEAN ret_val;  
      DbgPrint("Driver started\n");   
      DriverObject->DriverUnload = &MyUnload;   
      hash_table = ExAllocatePoolWithTag(NonPagedPool, sizeof(RTL_DYNAMIC_HASH_TABLE), 'tag4');  
      if(hash_table == NULL)  
           DbgPrint("Error in memory allocation\n");   
      ret_val = RtlCreateHashTable(&hash_table, 0, 0);  
      DbgPrint("Create hashtable returned %08x\n", ret_val);   
      DbgPrint("Entry1 address: %08x Entry2 address: %08x Entry3 address: %08x\n", &entry1, &entry2, &entry3);   
      entry1.Signature = sig;  
      ret_val = RtlInsertEntryHashTable(hash_table, &entry1, (ULONG_PTR)&sig, &context);  
      DbgPrint("Insert hashtable 1 returned %08x\n", ret_val);   
      entry2.Signature = sig;  
      ret_val = RtlInsertEntryHashTable(hash_table, &entry2, (ULONG_PTR)&sig, &context);  
      DbgPrint("Insert hashtable 2 returned %08x\n", ret_val);   
      entry3.Signature = sig + 3;  
      ret_val = RtlInsertEntryHashTable(hash_table, &entry3, (ULONG_PTR)&sig, &context);  
      DbgPrint("Insert hashtable 3 returned %08x\n", ret_val);   
      result_hash = RtlLookupEntryHashTable(hash_table, (ULONG_PTR)&sig, &context);  
      DbgPrint("Lookup hashtable returned entry at address %08x sig: %d \n", result_hash, *(ULONG*)(result_hash->Signature) );   
      ret_val = RtlRemoveEntryHashTable(hash_table, &entry1, &context);  
      DbgPrint("Remove hashtable returned %08x\n", ret_val);   
      ExFreePoolWithTag(hash_table, 'tag4');  
   return STATUS_SUCCESS;  

it produces the following output:

Driver started
Create hashtable returned 00000001
Entry1 address: 031667f0  Entry2 address: 03166808  Entry3 address: 03166820
Insert hashtable 1 returned 00000001
Insert hashtable 2 returned 00000001
Insert hashtable 3 returned 00000001
Lookup hashtable returned entry at address 03166820  sig: 1234 
Remove hashtable returned 00000001
Unload routine


FIELD_OFFSET is defined in ntdef.h as:
#define FIELD_OFFSET(type, field)    ((LONG)(LONG_PTR)&(((type *)0)->field))

The macro casts the number "0" to a pointer of type "type" (e.g. KPCR, or any other structure). Then, it considers the field "field" from this structure pointer, which translates in adding a delta (the offset of "field" in the structure) to the pointer. It uses the "&" operator to retrieve the address of the field itself, but since the pointer to the structure is 0, the whole operation simply retrieves the address 0 + Delta, that is: Delta. This value is finally cast to a pointer to long, then to long, which is the result of the macro.


Taken from Win7 x64:

ExGetCurrentProcessorCpuUsage proc near
mov rdx, gs:20h               GS = KPCR, +0x020 CurrentPrcb : Ptr64 _KPRCB
mov rax, [rdx+18h]            rdx = KPRCB, +0x018 IdleThread : Ptr64 _KTHREAD
mov r8d, [rdx+4708h]          rdx = KPRCB, +0x4708 UserTime
add r8d, [rdx+4704h]          rdx = KPRCB, +0x4704 KernelTime
mov eax, [rax+284h]           rax = KTHREAD, +0x284 KernelTime (of idle thread)
xor edx, edx
imul rax, 64h                 KTHREAD.KernelTime * 100 
div r8                        divided by KPRCB.UserTime + KPRCB.KernelTime
mov edx, 64h
sub edx, eax                  subtract the result of the division from 100
mov [rcx], edx                store result in [rcx]
ExGetCurrentProcessorCpuUsage endp

This function calculates the percentage of CPU usage time by using the following formula:

    x = (ETHREAD.KernelTime * 100) / (KPRCB.UserTime + KPRCB.KernelTime)

which can be derived from a simple proportion:

    (KPRCB.UserTime + KPRCB.KernelTime) : 100 = ETHREAD.KernelTime : x
       total cpu usage time                      idle thread usage time

if x is the percentage of time the CPU spent in the idle thread, the returned value is 100 - x (time spent outside idle, that is: time during which the CPU was busy).
The XP x86 version works in a similar way.


Taken from Win7 x64:

KeGetCurrentIrql proc near
mov rax, cr8
KeGetCurrentIrql endp

The api simply gets the current IRQL from register CR8, which on x64 is a shadow register for the Task Priority Register (TPR) from the LAPIC chip.

The same api taken from Xp x86 shows a slightly different implementation:

806d12e8 a18000feff      mov     eax,dword ptr ds:[FFFE0080h]
806d12ed c1e804          shr     eax,4
806d12f0 0fb68088c06d80  movzx   eax,byte ptr hal!HalpVectorToIRQL (806dc088)[eax]
806d12f7 c3              ret

In this system (as in most systems), the LAPIC data is mapped at address 0xFFFE0000, and offset 0x80 contains the TPR. The api gets the value of such register, which is normally just a byte, discards the lowest 4 bits, and uses the remaining ones as an index into the HalpVectorToIRQL array. Basically, it translates the hardware TPR to a software IRQL. As a test, I can use the debugger to see the value of the TPR:

kd> dd FFFE0080
fffe0080  000000d1 00000000 000000d1 00000000

Removing the lowest 4 bits, the remaining number is 0x0D, which can be used as an index in the HalpVectorToIRQL array:

kd> db HalpVectorToIRQL 
806dc088  00 ff ff 01 02 ff 05 06-07 08 09 0a 1b 1c 1d 1e 

Thus the current IRQL is 0x1C (CLOCK2_LEVEL).


Taken from Win7 x64:

PEPROCESS IoThreadToProcess(_In_  PETHREAD Thread);

IoThreadToProcess proc near
mov rax, [rcx+210h]   ETHREAD.Tcb.Process (pointer to EPROCESS)
IoThreadToProcess endp

The api returns the pointer to EPROCESS that is stored in the ETHREAD whose pointer is passed as an argument.


PsGetThreadProcessId proc near
mov rax, [rcx+3B8h]   ETHREAD.Cid.UniqueProcess
PsGetThreadProcessId endp

The api returns the PID stored in the ETHREAD whose pointer is passed as an argument.


BOOLEAN PsIsSystemThread(_In_  PETHREAD Thread);

PsIsSystemThread proc near
mov eax, [rcx+4Ch]    ETHREAD.Tcb.MiscFlags (contains several flags)
shr eax, 0Dh          0d = 13 = SystemThread  : Pos 13, 1 Bit
and al, 1
PsIsSystemThread endp

This api accesses the MiscFlags field of the KTHREAD structure from the ETHREAD whose pointer is passed as an argument. This field is in the form of a bitfield and contains the following bits/flags:

      +0x04c KernelStackResident : Pos 0, 1 Bit
      +0x04c ReadyTransition  : Pos 1, 1 Bit
      +0x04c ProcessReadyQueue : Pos 2, 1 Bit
      +0x04c WaitNext         : Pos 3, 1 Bit
      +0x04c SystemAffinityActive : Pos 4, 1 Bit
      +0x04c Alertable        : Pos 5, 1 Bit
      +0x04c GdiFlushActive   : Pos 6, 1 Bit
      +0x04c UserStackWalkActive : Pos 7, 1 Bit
      +0x04c ApcInterruptRequest : Pos 8, 1 Bit
      +0x04c ForceDeferSchedule : Pos 9, 1 Bit
      +0x04c QuantumEndMigrate : Pos 10, 1 Bit
      +0x04c UmsDirectedSwitchEnable : Pos 11, 1 Bit
      +0x04c TimerActive      : Pos 12, 1 Bit
      +0x04c SystemThread     : Pos 13, 1 Bit
      +0x04c Reserved         : Pos 14, 18 Bits
      +0x04c MiscFlags        : Int4B

Thus, the flag at position 13 is the one being returned, it corresponds to the "SystemThread" flag.


HANDLE PsGetCurrentThreadId(void);

PsGetCurrentThreadId proc near
mov rax, gs:188h       KPCR.Prcb.CurrentThread (pointer to ETHREAD)
mov rax, [rax+3C0h]    CurrentThread.UniqueThread
PsGetCurrentThreadId endp

The api gets the pointer to the current ETHREAD from the KPCR structure, from which it returns the TID.
The next apis are similar to this one.


PsGetCurrentThreadPreviousMode proc near 
mov rax, gs:188h       KPCR.Prcb.CurrentThread
mov al, [rax+1F6h]     CurrentThread.PreviousMode
PsGetCurrentThreadPreviousMode endp


PsGetCurrentThreadProcessId proc near   
mov rax, gs:188h       KPCR.Prcb.CurrentThread
mov rax, [rax+3B8h]    CurrentThread.Cid.UniqueProcess
PsGetCurrentThreadProcessId endp


PsGetCurrentThreadStackBase proc near
mov rax, gs:188h       KPCR.Prcb.CurrentThread
mov rax, [rax+278h]    CurrentThread.StackBase
PsGetCurrentThreadStackBase endp


PsGetCurrentThreadWin32Thread proc near
mov rax, gs:188h       KPCR.Prcb.CurrentThread
mov rax, [rax+270h]    CurrentThread.Win32Thread
PsGetCurrentThreadWin32Thread endp


PsGetThreadId proc near
mov rax, [rcx+3C0h]    ETHREAD.UniqueThread
PsGetThreadId endp

This api returns the TID from the ETHREAD whose pointer is passed as a parameter.


PsGetThreadSessionId proc near
mov rcx, [rcx+210h]   ETHREAD.Tcb.Process   (points to a KPROCESS)
jmp MmGetSessionId
PsGetThreadSessionId endp

MmGetSessionId proc near
mov rax, [rcx+2D8h]   KPROCESS.Session
xor r8d, r8d
or edx, 0FFFFFFFFh
cmp rax, r8           KPROCESS.Session == 0 ?
jz short loc_140036959
cmp rcx, cs:PsInitialSystemProcess  is the KPROCESS the InitialSystemProcess?
jz short loc_140036959
mov eax, [rax+8]      Session points to a MM_SESSION_SPACE structure, Session+8 is MM_SESSION_SPACE.SessionId

cmp eax, edx
cmovz eax, r8d  if (eax == edx) return 0 else return eax
mov eax, edx
jmp short loc_140036952
MmGetSessionId endp

The api gets the KPROCESS from the ETHREAD whose pointer is passed as  a parameter, then jumps to MmSessionId, which returns the session ID of the current thread. However, if the current thread is the initial system process, or if its pointer to the MM_SESSION_SPACE is NULL, then it returns 0.


PsIsSystemProcess proc near
cmp rcx, cs:PsInitialSystemProcess (this is a pointer to an EPROCESS)
setz al
PsIsSystemProcess endp

The api compares the EPROCESS whose pointer is passed as a parameter with the EPROCESS pointer stored into the variable PsInitialSystemProcess, to verify if they are equal or not.


PsGetProcessImageFileName proc near
lea rax, [rcx+2E0h]    EPROCESS.ImageFileName
PsGetProcessImageFileName endp

The api returns the address of the name of the file used to create the process, taken from the EPROCESS whose pointer is passed as an input parameter.


The offsets of the members in the system structures vary across different versions of Windows (and service packs too). In order to locate the members without hardcoding each possible offset, one possible solution would be to locate fields whose value is known, and then use a relative offset to reach nearby ones. For example, in the case of the EPROCESS structure, the ActiveProcessLinks field has the following offsets in different Windows versions (the list is not exhaustive):

  (all 32 bit)
  xp:     +0x88
  vista:  +0xA0
  2003:   +0x98
  7:      +0xb8

However, for all these versions, the UniqueProcessId field is always the one that precedes ActiveProcessLinks, and this property can be leveraged to programmatically locate the ActiveProcessLinks without knowing its offset. In fact, knowing the value of a PID related to an EPROCESS, it is possible to search the entire memory buffer of the EPROCESS structure, looking for the DWORD that contains the known PID value. Once the PID is found, ActiveProcessLinks is located at the following DWORD.
Another solution could make use of the kernel APIs: in question 11 I have described various APIs that simply retrieve data from system structures. A rootkit could inspect the code of such APIs by disassembling their instructions, and locating the offsets that they use to extract fields from structures. For example, if we consider the opcodes of the following API:

PsGetCurrentThreadProcessId proc near
65 48 8B 04 25 88 01 00 00   mov    rax, gs:188h KPCR.Prcb.CurrentThread
48 8B 80 B8 03 00 00         mov    rax, [rax+3B8h] CurrentThread.Cid.UniqueProcess
C3                           retn
PsGetCurrentThreadProcessId endp

a rootkit could parse them and retrieve the offset of the Cid.UniqueProcess field from the second instruction.


If you use MmGetPhysicalAddress to resolve a virtual address to a physical address, you have no guarantee that the physical address will still be valid afterwards. For example, if the physical address corresponds to a virtual address in the paged pool, then the data at the physical memory page may be swapped to disk in any moment. This means that the physical page will be filled with other data used by e.g. some other process, thus the driver that called MmGetPhysicalAddress would end up reading the wrong data from physical memory.
In order to avoid these problems, a driver should first build a MDL describing the Virtual memory address that it wants to query, then it should use the MmProbeAndLockPages API that makes the page resident in case it was swapped out, probes it for valid r/w access and locks it in memory so that it won't be paged out (the API should be called from within a try/except statement). At this point, the driver can call MmGetPhysicalAddress and it can be sure that the data in the physical memory corresponds indeed to the the virtual address, and that it will not be accidentally swapped out.


This exercise consists in setting up test-signign on a 32 bit and a 64 bit machine. I did not dedicate time to this, since it is a quite tedious task which seems more oriented to installing and configuring rather than reverse engineering :( Besides, the whole procedure is described here:
Thus, I am going to skip it for now. In my tests I simply used Windows' test mode, and I signed my drivers with a dummy signature using the tool "Driver Signature Enforcement Overrider" (available from http://www.ngohq.com/?page=dseo).


RtlImageNtHeader: takes in input the virtual address of an executable module, and returns a pointer to the corresponding PE structure. First, the API validates the input parameters by checking that it's not 0 or -1, and that it points to the signature "MZ" (that is, the signature of a IMAGE_DOS_HEADER structure). After that, the API gets the "elfanew" member of the IMAGE_DOS_HEADER structure (that is the offset to the PE header starting from the imagebase) and composes the virtual address of the possible PE file by adding the "e_lfanew" to the imagebase. The code then evaluates two cases. In the first case, if the address of the PE structure is a kernelmode one (that is, if it is bigger than MmHighestUserAddress), then it simply verifies that it points to the signature "PE\0\0" (the signature of a IMAGE_NT_HEADERS structure) and returns it (or it returns 0 otherwise). In the second case, if the address of the PE structure is in usermode, then it verifies that the whole PE header is contained within the usermode (that is, imagebase + sizeof(IMAGE_DOS_HEADER) + sizeof(IMAGE_NT_HEADERS) < MmHighestUserAddress) and returns its address (or it returns 0 otherwise).

RtlImageDirectoryEntryToData: begins by checking the least significant bit in the ImageBase parameter: if it is set, then the variable "MappedAsImage" is set to 0, and the bit itself is cleared (normally, an ImageBase is supposed to be aligned on a page boundary, that is, to have the least significant 12 bits set to zero). The API then obtains the virtual address of the PE header given its ImageBase by calling the function RtlImageNtHeader, it checks the OptionalHeader.Magic field to determine if the image is a 32 bit or 64 bit one, and calls either _RtlpImageDirectoryEntryToData32 or _RtlpImageDirectoryEntryToData64, returning its return value (the function _RtlpImageDirectoryEntryToData64 operates in the same way as the 32 bit counterpart, thus it won't be discussed).
The _RtlpImageDirectoryEntryToData32 function verifies that the input DataDirectory parameter is less than OptionalHeader.NumberOfRvaAndSizes, or else it returns zero. It then accesses OptionalHeader.DataDirectory[DataDirectory].RVA, verifies that it is non-zero, and then it verifies, in case the input PE file is a usermode one, that the address of the obtained DataDirectory entry is still inside usermode. If the MappedAsImage parameter is nonzero, then the virtual address (obtained adding the RVA and the ImageBase) of the data directory is returned, along with its size stored in the "Size" parameter. However, if MappesAsImage is zero, there is some more processing that needs to be done. If the RVA of the data directory is within the size of the headers (OptionalHeader.SizeOfHeaders), then it is converted to VA (again by adding to it the ImageBase) and it is returned. If it is bigger, the function needs to call _RtlAddressInSectionTable to convert the RVA to VA. This step is required because if the PE file being examined is not mapped as an image, then its sections are not being mapped according to the OptionalHeader.SectionAlignment, but they will be aligned to OptionalHeader.FileAlignment (that is, they are not aligned to virtual memory page boundaries, but they are aligned to the file offset alignment which is in general smaller than the virtual one). What the _RtlAddressInSectionTable function does, in fact, is to retrieve the SectionHeader that contains a particular virtual address (through the function _RtlSectionTableFromVirtualAddress), and then use the following formula:

  SectionHeader.PointerToRawData - SectionHeader.VirtualAddress + ImageBase + RVA

to convert the RVA to a file offset.

I was not able to find the AuxKlibGetImageExportDirectory API. I searched in my machines, but it does not seem to be part of the standard drivers installation on XP or Windows 7. I tried to locate the driver that exports it to see if I could download it, but I have not found it yet. As soon as I find such driver I will analyze this API.


To track the life and death of processes I would set a callback with PsSetCreateProcessNotifyRoutine in order to be notified whenever a process is created or terminated. To track each process I would use the ProcessIDs, which uniquely identify them. If I need more information (e.g. the image name of the process) I would retrieve them from the EPROCESS structure.


The page directory is stored in the DirectoryTableBase field of the EPROCESS structure:

+0x028 DirectoryTableBase : 0x3db1f000