Windows Kernel Internals II
Processes, Threads,
      VirtualMemory
  University of Tokyo – July 2004*
        Dave Probert, Ph.D.
  Advanced Operating Systems Group
Windows Core Operating Systems Division
         Microsoft Corporation
           © Microsoft Corporation 2004   1
               Windows Architecture
                                          Applications
                         DLLs                 System Services              Login/GINA
Subsystem
 servers             Kernel32                  Critical services            User32 / GDI
 User-mode                            ntdll / run-time library
 Kernel-mode                          Trap interface / LPC
Security refmon   IO Manager         Virtual memory       Procs & threads      Win32 GUI
          File filters
                                FS run-time                    Scheduler
         File systems
         Volume mgrs                                        exec synchr
                                 Cache mgr
         Device stacks
                         Object Manager / Configuration Management
                    Kernel run-time / Hardware Adaptation Layer
                                © Microsoft Corporation 2004                               2
                     Process
Container for an address space and threads
Associated User-mode Process Environment Block (PEB)
Primary Access Token
Quota, Debug port, Handle Table etc
Unique process ID
Queued to the Job, global process list and Session list
MM structures like the WorkingSet, VAD tree, AWE etc
                    © Microsoft Corporation 2004          3
                      Thread
Fundamental schedulable entity in the system
Represented by ETHREAD that includes a KTHREAD
Queued to the process (both E and K thread)
IRP list
Impersonation Access Token
Unique thread ID
Associated User-mode Thread Environment Block (TEB)
User-mode stack
Kernel-mode stack
Processor Control Block (in KTHREAD) for cpu state when
  not running
                   © Microsoft Corporation 2004           4
                             Job
Container for multiple processes
Queued to global job list, processes and jobs in the job set
Security token filters and job token
Completion ports
Counters, limits etc
                     © Microsoft Corporation 2004              5
          Process/Thread structure
    Any Handle        Object                  Process
      Table          Manager                   Object
                                                         Thread
                                                         Thread
 Files                            Virtual
             Process’                                    Thread
                                 Address
Events     Handle Table
                                Descriptors              Thread
Devices
                                                         Thread
Drivers
                                                         Thread
                          © Microsoft Corporation 2004            6
              KPROCESS fields
DISPATCHER_HEADER Header                    KAFFINITY Affinity
ULPTR DirectoryTableBase[2]                 USHORT StackCount
KGDTENTRY LdtDescriptor                     SCHAR BasePriority
KIDTENTRY Int21Descriptor                   SCHAR ThreadQuantum
USHORT IopmOffset                           BOOLEAN AutoAlignment
UCHAR Iopl                                  UCHAR State
volatile KAFFINITY ActiveProcessors         BOOLEAN DisableBoost
ULONG KernelTime                            UCHAR PowerState
ULONG UserTime                              BOOLEAN DisableQuantum
LIST_ENTRY ReadyListHead                    UCHAR IdealNode
SINGLE_LIST_ENTRY SwapListEntry
LIST_ENTRY ThreadListHead
KSPIN_LOCK ProcessLock
                       © Microsoft Corporation 2004                  7
              EPROCESS fields
KPROCESS Pcb                           KGUARDED_MUTEX
EX_PUSH_LOCK ProcessLock                  AddressCreationLock
LARGE_INTEGER CreateTime               KSPIN_LOCK HyperSpaceLock
LARGE_INTEGER ExitTime                 struct _ETHREAD *ForkInProgress
EX_RUNDOWN_REF                         ULONG_PTR HardwareTrigger;
   RundownProtect                      PMM_AVL_TABLE
HANDLE UniqueProcessId                    PhysicalVadRoot
LIST_ENTRY ActiveProcessLinks          PVOID CloneRoot
Quota Felds                            PFN_NUMBER
SIZE_T PeakVirtualSize                    NumberOfPrivatePages
SIZE_T VirtualSize                     PFN_NUMBER
                                          NumberOfLockedPages
LIST_ENTRY SessionProcessLinks         PVOID Win32Process
PVOID DebugPort                        struct _EJOB *Job
PVOID ExceptionPort                    PVOID SectionObject
PHANDLE_TABLE ObjectTable              PVOID SectionBaseAddress
EX_FAST_REF Token                      PEPROCESS_QUOTA_BLOCK
PFN_NUMBER WorkingSetPage                 QuotaBlock
                       © Microsoft Corporation 2004                8
              EPROCESS fields
PPAGEFAULT_HISTORY                           PVOID AweInfo
   WorkingSetWatch                           MMSUPPORT Vm
HANDLE Win32WindowStation                    Process Flags
HANDLE InheritedFromUniqueProcessId          NTSTATUS ExitStatus
PVOID LdtInformation                         UCHAR PriorityClass
PVOID VadFreeHint                            MM_AVL_TABLE VadRoot
PVOID VdmObjects
PVOID DeviceMap
PVOID Session
UCHAR ImageFileName[ 16 ]
LIST_ENTRY JobLinks
PVOID LockedPagesList
LIST_ENTRY ThreadListHead
ULONG ActiveThreads
PPEB Peb
IO Counters
                       © Microsoft Corporation 2004                 9
               KTHREAD fields
DISPATCHER_HEADER Header              UCHAR EnableStackSwap
LIST_ENTRY MutantListHead             volatile UCHAR SwapBusy
PVOID InitialStack, StackLimit        LIST_ENTRY WaitListEntry
PVOID KernelStack                     NEXT SwapListEntry
KSPIN_LOCK ThreadLock                 PRKQUEUE Queue
ULONG ContextSwitches                 ULONG WaitTime
volatile UCHAR State                  SHORT KernelApcDisable
KIRQL WaitIrql                        SHORT SpecialApcDisable
KPROC_MODE WaitMode                   KTIMER Timer
PVOID Teb                             KWAIT_BLOCK WaitBlock[N+1]
KAPC_STATE ApcState                   LIST_ENTRY QueueListEntry
KSPIN_LOCK ApcQueueLock               UCHAR ApcStateIndex
LONG_PTR WaitStatus                   BOOLEAN ApcQueueable
PRKWAIT_BLOCK WaitBlockList           BOOLEAN Preempted
BOOLEAN Alertable, WaitNext           BOOLEAN ProcessReadyQueue
UCHAR WaitReason                      BOOLEAN KernelStackResident
SCHAR Priority
                      © Microsoft Corporation 2004              10
         KTHREAD fields cont.
UCHAR IdealProcessor                  PKTRAP_FRAME TrapFrame
volatile UCHAR NextProcessor          ULONG KernelTime, UserTime
SCHAR BasePriority                    PVOID StackBase
SCHAR PriorityDecrement               KAPC SuspendApc
SCHAR Quantum                         KSEMAPHORE SuspendSema
BOOLEAN SystemAffinityActive          PVOID TlsArray
CCHAR PreviousMode                    LIST_ENTRY ThreadListEntry
UCHAR ResourceIndex                   UCHAR LargeStack
UCHAR DisableBoost                    UCHAR PowerState
KAFFINITY UserAffinity                UCHAR Iopl
PKPROCESS Process                     CCHAR FreezeCnt, SuspendCnt
KAFFINITY Affinity                    UCHAR UserIdealProc
PVOID ServiceTable                    volatile UCHAR DeferredProc
PKAPC_STATE ApcStatePtr[2]            UCHAR AdjustReason
KAPC_STATE SavedApcState              SCHAR AdjustIncrement
PVOID CallbackStack
PVOID Win32Thread
                      © Microsoft Corporation 2004              11
ETHREAD fields
  KTHREAD tcb
  Timestamps
  LPC locks and links
  CLIENT_ID Cid
  ImpersonationInfo
  IrpList
  pProcess
  StartAddress
  Win32StartAddress
  ThreadListEntry
  RundownProtect
  ThreadPushLock
   © Microsoft Corporation 2004   12
     Process Synchronization
ProcessLock – Protects thread list, token
RundownProtect – Cross process address space,
  image section and handle table references
Token, Prefetch – Uses fast referencing
Token, Job – Torn down at last process
  dereference without synchronization
                © Microsoft Corporation 2004    13
                                 Thread
                               scheduling
                                 states
© Microsoft Corporation 2004                14
      Thread scheduling states
• Main quasi-states:
   – Ready – able to run
   – Running – current thread on a processor
   – Waiting – waiting an event
• For scalability Ready is three real states:
   – DeferredReady – queued on any processor
   – Standby – will be imminently start Running
   – Ready – queue on target processor by priority
• Goal is granular locking of thread priority
  queues
• Red states related to swapped stacks and
  processes     © Microsoft Corporation 2004         15
              Process Lifetime
Created as an empty shell
Address space created with only ntdll and the main image
  unless forked
Handle table created empty or populated via duplication
  from parent
Process is partially destroyed on last thread exit
Process totally destroyed on last dereference
                    © Microsoft Corporation 2004           16
                Thread Lifetime
Created within a process with a CONTEXT record
Starts running in the kernel but has a trap frame to return to
  use mode
Kernel queues user APC to do ntdll initialization
Terminated by a thread calling NtTerminateThread/Process
                      © Microsoft Corporation 2004          17
Summary: Native NT Process APIs
NtCreateProcess()                 NtCreateThread()
NtTerminateProcess()              NtTerminateThread()
NtQueryInformationProcess()       NtSuspendThread()
NtSetInformationProcess()         NtResumeThread()
NtGetNextProcess()                NtGetContextThread()
NtGetNextThread()                 NtSetContextThread()
NtSuspendProcess()                NtQueryInformationThread()
NtResumeProcess()                 NtSetInformationThread()
                                  NtAlertThread()
                                  NtQueueApcThread()
                   © Microsoft Corporation 2004            18
        Virtual Memory Manager
                       Features
Provides 4 GB flat virtual address space (IA32)
Manages process address space
Handles pagefaults
Manages process working sets
Manages physical memory
Provides memory-mapped files
Allows pages shared between processes
Facilities for I/O subsystem and device drivers
Supports file system cache manager
                    © Microsoft Corporation 2004   19
        Virtual Memory Manager
                    NT Internal APIs
NtCreatePagingFile
NtAllocateVirtualMemory (Proc, Addr, Size, Type,
  Prot)
    Process: handle to a process
    Protection: NOACCESS, EXECUTE, READONLY,
       READWRITE, NOCACHE
    Flags: COMMIT, RESERVE, PHYSICAL, TOP_DOWN,
       RESET, LARGE_PAGES, WRITE_WATCH
NtFreeVirtualMemory(Process, Address, Size,
  FreeType)
    FreeType: DECOMMIT or RELEASE
NtQueryVirtualMemory
NtProtectVirtualMemory
                 © Microsoft Corporation 2004      20
       Virtual Memory Manager
                 NT Internal APIs
Pagefault
NtLockVirtualMemory, NtUnlockVirtualMemory
   – locks a region of pages within the working set list
   – requires PROCESS_VM_OPERATION on target
     process and SeLockMemoryPrivilege
NtReadVirtualMemory, NtWriteVirtualMemory (
                                Proc, Addr, Buffer, Size)
NtFlushVirtualMemory
                    © Microsoft Corporation 2004            21
       Virtual Memory Manager
                 NT Internal APIs
NtCreateSection
   – creates a section but does not map it
NtOpenSection
   – opens an existing section
NtQuerySection
   – query attributes for section
NtExtendSection
NtMapViewOfSection (Sect, Proc, Addr, Size, …)
NtUnmapViewOfSection
                    © Microsoft Corporation 2004   22
          Virtual Memory Manager
                   NT Internal APIs
APIs to support AWE (Address Windowing Extensions)
   – Private memory only
   – Map only in current process
   – Requires LOCK_VM privilege
NtAllocateUserPhysicalPages (Proc, NPages, &PFNs[])
NtMapUserPhysicalPages (Addr, NPages, PFNs[])
NtMapUserPhysicalPagesScatter
NtFreeUserPhysicalPages (Proc, &NPages, PFNs[])
NtResetWriteWatch
NtGetWriteWatch
   Read out dirty bits for a section of memory since last
    reset            © Microsoft Corporation 2004           23
    Allocating kernel memory (pool)
•    Tightest x86 system resource is KVA
      Kernel Virtual Address space
•    Pool allocates in small chunks:
      < 4KB: 8B granulariy
      >= 4KB: page granularity
•    Paged and Non-paged pool
      Paged pool backed by pagefile
•    Special pool used to find corruptors
•    Lots of support for debugging/diagnosis
                  © Microsoft Corporation 2004   24
80000000
             System code, initial non-paged pool
A0000000
                 Session space (win32k.sys)
A4000000
              Sysptes overflow, cache overflow
C0000000
           Page directory self-map and page tables
C0400000
              Hyperspace (e.g. working set list)     x86
C0800000
                     Unused – no access
C0C00000
                   System working set list
C1000000
                        System cache
E1000000
                         Paged pool
E8000000
               Reusable system VA (sysptes)
                 Non-paged pool expansion
FFBE0000
                  Crash dump information
FFC00000
                         HAL usage
                   © Microsoft Corporation 2004        25
            Valid x86 Hardware PTEs
                                                                Reserved
                                                                Global
                                                                Dirty
                                                                Accessed
                                                                Cache disabled
                                                                Write through
                                                                Owner
                                                                Write
Pageframe      R R R G R D A Cd Wt O W 1
31          12 11 10 9     8    7    6    5     4 3     2   1   0
                         © Microsoft Corporation 2004                     26
       Virtual Address Translation
CR3
         PD                PT                     page    DATA
        1024              1024                    4096
        PDEs              PTEs                    bytes
0000 0000 0000 0000 0000 0000 0000 0000
                   © Microsoft Corporation 2004             27
          Self-mapping page tables
•   Page Table Entries (PTEs) and Page Directory Entries
    (PDEs) contain Physical Frame Numbers (PFNs)
    – But Kernel runs with Virtual Addresses
•   To access PDE/PTE from kernel use the self-
    map for the current process:
    PageDirectory[0x300] uses PageDirectory as
      PageTable
    – GetPdeAddress(va): 0xc0300000[va>>20]
    – GetPteAddress(va): 0xc0000000[va>>10]
•   PDE/PTE formats are compatible!
•   Access another process VA via thread ‘attach’
                    © Microsoft Corporation 2004       28
    Self-mapping page tables
Virtual Access to PageDirectory[0x300]
    CR3
                              Phys: PD[0xc0300000>>22] = PD
                              Virt: *((0xc0300c00) == PD
              PD
      0x300
                                PTE
     0000 0000 0011
     1100      0000 0000 0000 1100
                              0000 0000 0000
              © Microsoft Corporation 2004                    29
            Self-mapping page tables
           Virtual Access to PTE for va 0xe4321000
  CR3
                                                 GetPteAddress:
                                                 0xe4321000
            PD                   PT              => 0xc0390c84
0x300                  0x321
0x390                                                PTE
   0000 0000 0011
   1100      0000 1001
                  0000 0000 1100
                            0000 1000
                                 0000 0100
                                      0000
                      © Microsoft Corporation 2004                30
               x86 Invalid PTEs
                                                   Transition
Page file                                          Prototype
Page file offset 0            Protection               PFN      0
31          12 11 10 9                           5 4         1 0
                                                   Transition
Transition                                         Prototype
Page file offset 1            Protection           HW ctrl 0
31           12 11 10 9                          5 4         1 0
         Cache disable
         Write through
                Owner
                 Write
                  © Microsoft Corporation 2004                      31
                x86 Invalid PTEs
Demand zero:         Page file PTE with zero offset and
 PFN
Unknown:       PTE is completely zero or Page Table
 doesn’t exist yet. Examine VADs.
Pointer to Prototype PTE
    pPte bits 7-27                      pPte bits 0-6      0
   31           12 11 10 9 8 7                      5 4   1 0
                     © Microsoft Corporation 2004               32
            Prototype PTEs
• Kept in array in the segment structure
  associated with section objects
• Six PTE states:
  – Active/valid
  – Transition
  – Modified-no-write
  – Demand zero
  – Page file
  – Mapped file
               © Microsoft Corporation 2004   33
Shared Memory Data Structures
         © Microsoft Corporation 2004   34
       Physical Memory Management
                                       Process/System                       Soft
             Soft
                                         Working Set                        Fault
             Fault
                               Trim                      Trim
                               Clean                     Dirty
    Delete
    Page
                                            Modified
                     Standby                                     Modified
                                             Page-
                       List                                        List
                                             writer
                     MM Low
                     Memory             Physical Page State
                                             Changes
Hardfault                                                                           Zerofault
 (DISK)                                                                              (FILL)
                      Free                Zero                    Zero
                      List               Thread                   List
                                © Microsoft Corporation 2004                                    35
                Paging Overview
Working Sets: list of valid pages for each process
  (and the kernel)
Pages ‘trimmed’ from working set on lists
  Standby list: pages backed by disk
  Modified list: dirty pages to push to disk
  Free list: pages not associated with disk
  Zero list: supply of demand-zero pages
Modify/standby pages can be faulted back into a
  working set w/o disk activity (soft fault)
Background system threads trim working sets,
  write modified pages and produce zero pages
  based on memory state and config parameters
                    © Microsoft Corporation 2004   36
         Managing Working Sets
Aging pages: Increment age counts for pages
  which haven't been accessed
Estimate unused pages: count in working set and
  keep a global count of estimate
When getting tight on memory: replace rather
  than add pages when a fault occurs in a working
  set with significant unused pages
When memory is tight: reduce (trim) working sets
  which are above their maximum
Balance Set Manager: periodically runs Working
  Set Trimmer, also swaps out kernel stacks of
  long-waiting threads
                 © Microsoft Corporation 2004   37
Discussion
© Microsoft Corporation 2004   38