Practical Malware Analysis
Ch 13: Data Encoding
                             Revised 4-25-16
The Goal of Analyzing
 Encoding Algorithms
Reasons Malware Uses Encoding
• Hide configuration information
  – Such as C&C domains
• Save information to a staging file
  – Before stealing it
• Store strings needed by malware
  – Decode them just before they are needed
• Disguise malware as a legitimate tool
  – Hide suspicious strings
Simple Ciphers
     Why Use Simple Ciphers?
• They are easily broken, but
  – They are small, so they fit into space-
    constrained environments like exploit
    shellcode
  – Less obvious than more complex ciphers
  – Low overhead, little impact on performance
• These are obfuscation, not encryption
  – They make it difficult to recognize the data,
    but can't stop a skilled analyst
            Caesar Cipher
• Move each letter forward 3 spaces in the
  alphabet
   ABCDEFGHIJKLMNOPQRSTUVWXYZ
   DEFGHIJKLMNOPQRSTUVWXYZABC
• Example
   ATTACK AT NOON
   DWWDFN DW QRRQ
                                    0 xor 0 = 0
                    XOR              0 xor 1 = 1
                                     1 xor 0 = 1
                                     1 xor 1 = 0
• Uses a key to encrypt data
• Uses one bit of data and one bit of the
  key at a time
• Example: Encode HI with a key of 0x3c
  HI = 0x48 0x49 (ASCII encoding)
  Data:      0100 1000 0100 1001
  Key:    0011 1100 0011 1100
  Result: 0111 0100 0111 0101
                                    0   xor   0   =   0
XOR Reverses Itself                 0
                                    1
                                        xor
                                        xor
                                              1
                                              0
                                                  =
                                                  =
                                                      1
                                                      1
                                    1   xor   1   =   0
• Example: Encode HI with a key of 0x3c
  HI = 0x48 0x49 (ASCII encoding)
  Data:       0100 1000 0100 1001
  Key:       0011 1100 0011 1100
  Result:    0111 0100 0111 0101
• Encode it again
  Result:    0111 0100 0111 0101
  Key:       0011 1100 0011 1100
  Data:      0100 1000 0100 1001
   Brute-Forcing XOR Encoding
• If the key is a single byte, there are only
  256 possible keys
  – Error in book; this should be "a.exe"
  – PE files begin with MZ
MZ = 0x4d 0x5a
Link Ch 13a
     Brute-Forcing Many Files
• Look for a
  common
  string, like
  "This Program"
             XOR and Nulls
• A null byte reveals the key, because
  – 0x00 xor KEY = KEY
• Obviously the key here is 0x12
   NULL-Preserving Single-Byte XOR
              Encoding
• Algorithm:
  – Use XOR encoding, EXCEPT
  – If the plaintext is NULL or the key itself, skip
    the byte
Identifying XOR Loops in IDA Pro
• Small loops with an XOR instruction inside
  1. Start in "IDA View" (seeing code)
  2. Click Search, Text
  3. Enter xor and Find all occurrences
         Three Forms of XOR
• XOR a register with itself, like xor edx, edx
  – Innocent, a common way to zero a register
• XOR a register or memory reference with a
  constant
  – May be an encoding loop, and key is the
    constant
• XOR a register or memory reference with a
  different register or memory reference
  – May be an encoding loop, key less obvious
                    Base64
• Converts 6 bits into one character in a 64-
  character alphabet
• There are a few versions, but all use these
  62 characters:
  ABCDEFGHIJKLMNOPQRSTUVWXYZ
  abcdefghijklmnopqrstuvwxyz
  0123456789
• MIME uses + and /
  – Also = to indicate padding
  Transforming Data to Base64
• Use 3-byte chunks (24 bits)
• Break into four 6-bit fields
• Convert each to Base64
base64encode.org
base64decode.org
• 3 bytes encode to 4
  Base64 characters
                 Padding
• If input had only 2
  characters, an = is
  appended
                 Padding
• If input had only 1
  character, == is
  appended
               Example
• URL and cookie are Base64-encoded
      Cookie: Ym90NTQxNjQ
• This has 11
  characters—
  padding is omitted
• Some Base64
  decoders will fail,
  but this one just
  automatically adds
  the missing padding
   Finding the Base64 Function
• Look for this "indexing string"
  ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghi
  jklmnopqrstuvwxyz0123456789+/
• Look for a lone padding character
  (typically =) hard-coded into the encoding
  function
         Decoding the URLs
• Custom indexing string
  aABCDEFGHIJKLMNOPQRSTUVWXYZbcdefghijk
  lmnopqrstuvwxyz0123456789+/
• Look for a lone padding character (typically
  =) hard-coded into the encoding function
Common Cryptographic
    Algorithms
          Strong Cryptography
• Strong enough to resist brute-force attacks
  – Ex: SSL, AES, etc.
• Disadvantages of strong encryption
  – Large cryptographic libraries required
  – May make code less portable
  – Standard cryptographic libraries are easily detected
     • Via function imports, function matching, or identification of
       cryptographic constants
  – Symmetric encryption requires a way to hide the key
Recognizing Strings and Imports
• Strings found in malware encrypted with
  OpenSSL
Recognizing Strings and Imports
• Microsoft crypto functions usually start
  with Crypt or CP or Cert
Searching for Cryptographic Constants
• IDA Pro's FindCrypt2 Plug-in (Link Ch 13c)
  – Finds magic constants (binary signatures of
    crypto routines)
  – Cannot find RC4 or IDEA routines because
    they don't use a magic constant
  – RC4 is commonly used in malware because it's
    small and easy to implement
              FindCrypt2
• Runs automatically on any new analysis
• Can be run manually from the Plug-In
  Menu
   Krypto ANALyzer (PEiD Plug-in)
• Download from link Ch 13d
• Has wider range of constants than FindCrypt2
  – More false positives
• Also finds Base64 tables and crypto function
  imports
                    Entropy
• Entropy measures disorder
• To calculate it, just count the number of
  occurrences of each byte from 0 to 255
  – Calculate Pi = Probability of value i
  – Then sum Pi log( Pi) for I = 0 to 255 (Link 13e)
• If all the bytes are equally likely, the
  entropy is 8 (maximum disorder)
• If all the bytes are the same, the entropy is
  zero
                          Entropy Demo
      • Put output in a file
      • Use binwalk -E to analyze the file
      • Multiply vertical axis by 8
#!/usr/bin/python
import base64, random
a = ''
for i in range(0, 10000):
 a += chr(random.randint(0,255))
b = base64.b64encode(a)
c = base64.b32encode(a)
d = base64.b16encode(a)
e = 'A' * 10000
print a + b + c + d + e
                                             41
           Entropy Demo
• Concatenate three images in different
  formats
                                          42
Searching for High-Entropy Content
• IDA Pro Entropy Plugin
• Finds regions of high entropy, indicating
  encryption (or compression)
   Recommended Parameters
• Chunk size: 64        Max. Entropy: 5.95
  – Good for finding many constants,
  – Including Base64-encoding strings (entropy 6)
• Chunk size: 256     Max. Entropy: 7.9
  – Finds very random regions
             Entropy Graph
• IDA Pro Entropy Plugin
  – Download from link Ch 13g
  – Use StandAlone version
  – Double-click region, then Calculate, Draw
  – Lighter regions have high entropy
  – Hover over graph to see numerical value
Custom Encoding
 Homegrown Encoding Schemes
• Examples
  – One round of XOR, then Base64
  – Custom algorithm, possibly similar to a
    published cryptographic algorithm
  Identifying Custom Encoding
• This sample makes a bunch of 700 KB files
• Figure out the encoding from the code
• Find CreateFileA and WriteFileA
  – In function sub_4011A9
• Uses XOR with a pseudorandom stream
Advantages of Custom Encoding to the
              Attacker
• Can be small and nonobvious
• Harder to reverse-engineer
Decoding
             Two Methods
• Reprogram the functions
• Use the functions in the malware itself
            Self-Decoding
• Stop the malware in a debugger with data
  decoded
• Isolate the decryption function and set a
  breakpoint directly after it
• BUT sometimes you can't figure out how
  to stop it with the data you need decoded
   Manual Programming of Decoding
              Functions
• Standard functions may be available
          PyCrypto Library
• Good for standard algorithms
How to Decrypt Using Malware