ISI 2018 Paper 15
ISI 2018 Paper 15
   Abstract— Cyber-attacks cost the global economy over $450              To combat this issue, CTI experts have urged the importance
billion annually. To combat this issue, researchers and
practitioners put enormous efforts into developing Cyber Threat
                                                                       of developing proactive CTI by directly investigating hackers
Intelligence, or the process of identifying emerging threats and key   within the online hacker community [2][3]. The international
hackers. However, the reliance on internal network data to has         online hacker community attracts and motivates millions of
resulted in inherently reactive intelligence. CTI experts have         hackers from the US, Russia, and China, to share or sell hacking
urged the importance of proactively studying the large, ever-          tools, knowledge, and other illegal products and services [4].
evolving online hacker community. Despite their CTI value,             Today, four major hacker community platforms exist: hacker
collecting data from hacker community platforms is a non-trivial
task. In this paper, we summarize our efforts in systematically
                                                                       forums, Internet-Relay-Chat (IRC), carding shops, and
identifying and automatically collecting a large-scale of hacker       DarkNet Marketplaces (DNMs). Exploits found on these
forums, carding shops, Internet-Relay-Chat, and DarkNet                platforms have executed well-publicized attacks such as the
Marketplaces. We also present our efforts to provide this data to      BlackPOS malware for the Target breach or the Mirai botnet
the larger CTI community via the AZSecure Hacker Assets Portal         for the internet-scale Denial-of-Service (DoS) attack in 2016.
(www.azsecure-hap.com). With our methodology, we collected 102
platforms for a total of 43,902,913 records. To the best of our           Despite each platform’s significant CTI value, collecting
knowledge, this compilation of hacker community data is the            their data is a non-trivial task. Hacker community platforms
largest such collection in academia, and can enable a numerous         carefully conceal themselves and employ numerous anti-
novel and valuable proactive CTI research inquiries.
                                                                       crawling measures that prevent automated, large-scale data
                                                                       collection. These barriers force many researchers to manual
  Keywords— Cybersecurity; Hacker community data collection;
Hacker forums; Internet-Relay-Chat; DarkNet Marketplaces;              collection efforts. Studies attempting automated collection are
Carding Shops; cyber threat intelligence                               limited to one platform type. In this paper, we summarize our
                                                                       work in identifying and automatically collecting a large-scale
                       I. INTRODUCTION
                                                                       of hacker forum, carding shop, IRC, and DNM data. We also
  With computer and information technology becoming more               present our efforts to provide this data to the larger CTI
ubiquitous, cybersecurity has become a grand societal                  community via the AZSecure Hacker Assets Portal. To the best
challenge. Today, malicious hackers commit numerous large-             of our knowledge, this collection of hacker community data is
scale, advanced attacks on industry and government                     the largest in academia. Consequently, it can enable a multitude
organizations. These cyber-attacks cost the global economy             of novel and valuable proactive CTI research inquiries.
over $450 billion annually [1]. Cyber Threat Intelligence (CTI),
                                                                          The remainder of this paper is organized as follows. First, we
or the process of identifying emerging threats and key threat
                                                                       review each platform, discuss their CTI value, and note existing
actors (i.e., hackers) to enable effective cybersecurity decisions,
                                                                       data collection strategies. Section III details our platform
has emerged as a viable approach to mitigate this concern.
                                                                       identification and collection methodology. Section IV
   Fundamentally a data-driven process, CTI has traditionally          summarizes our collected data and highlights promising CTI
focused on collecting data from internal network devices               research directions. Section V illustrates key functions of the
databases, IDS/IPS, routers, workstations, and others. Collected       Hacker Asset Portal. Section VI concludes this work.
data is analyzed using malware analysis, forensics, event
                                                                                 II. HACKER COMMUNITY PLATFORM REVIEW
correlation, and other well-established methods. Despite the
prevalence of these approaches, CTI experts from major                 A. Hacker Community Platforms Overview
cybersecurity firms, such as the SANS Institute, note that the            To the best of our knowledge, four hacker community
reliance data from past events (i.e., internal network data)           platforms exist: (1) forums, (2) IRC, (3) DNMs, and (4) carding
results in inherently reactive intelligence [2]. As a result, cyber-   shops. Taken together, these platforms have hundreds of
attacks remain on an unfortunate uptick.                               millions of records made by millions of hackers across the
                                                                                                                                                                   2
globe. Table I describes each platform and their CTI value.                          2) Internet-Relay-Chat (IRC)
     TABLE I. HACKER COMMUNITY PLATFORM SUMMARY                                         Built on a separate protocol, an IRC server can hold multiple
 Platform             Description                         CTI Value
                                                                                     channels, containing conversations about pre-defined topics [4].
                                                                                     Although not originally intended for hackers, IRC channels
                Message board                 Key threat actor identification;       have become a popular medium for hacktivist groups to share
  Hacker        allowing members to           sharing of hacking tools; indication
  Forums        post messages that are        of access to other hacker              hacking knowledge. Figures 2 and 3 illustrate two examples of
                archived                      communities                            user behaviors on hacker IRC. Figure 2 depicts hackers sharing
                                                                                     links to forums with illegal contents. Figure 3 illustrates an IRC
                Plain-text, instant           Sharing of hacking knowledge and
                messaging,                    potential target; indication of        user demanding hacking service with a provided target IP.
    IRC
                communication that is         access to other hacker communities
                not archived
                Shops selling stolen          Monitoring trafficking of internet       Figure 2. An example of hackers sharing links containing illegal contents
  Carding
                credit/debit cards and        fraud industry; precaution of
   Shops
                sensitive data                breaches before happen
   The four hacker community platforms create an ecosystem                                 Figure 3. An example of an IRC user demanding hacker service
of hacker activities. Hackers use forums and/or IRC to freely
discuss and share Tools, Techniques, and Processes (TTP) and                            Unlike forums, IRC conversations are not archived and must
advertise hacking services. Hackers freely download these tools                      be collected in real-time. Additionally, IRC messages are
or navigate to DNMs to purchase exploits. These tools help                           broadcast to all channel participants. If a user loses server
hackers conduct cyber-attacks to attain sensitive data such as                       connection, he/she cannot retrieve the conversation for that time
credit/debit cards, Social Security Numbers (SSN), and logins                        period [6]. This allows hackers to share hacking knowledge and
to sell on DNMs and/or carding shops for financial gain. Each                        targets more freely. As a result, collecting IRC data can help
platform is further detailed in the following sub-sections.                          understand hacker behaviors, targets, and emerging threats.
     Poster
  information
                                                                   Ransomware
                                                                                               Figure 4. An example of a product listing page on DNM
                                                                      code
 Figure 1. An example of a hacker forum member sharing ransomware code                Data and information stolen from breached companies are
                                                                                                                                             3
often sold on DNM. Thus, DNMs can serve as an early indicator        comprehensive CTI development. These issues motivate the
for breached organizations. Also, past research indicate the         development of large-scale, automated crawling approaches.
thriving of DNM has raised concern in public health and law                         III. COLLECTION METHODOLOGY
enforcement, for holding an abundant amount of drug listing
[8][9]. Researchers observed that DNM users often share                 We developed a systematic approach to gather a
                                                                     comprehensive collection of hacker community data. The
information about reliable seller and quality goods among
                                                                     process has three phases, platform identification, enhanced
DNM [9][10]. Moreover, DNM has been leveraged to explore
                                                                     automated collection, and content parsing. We summarize each
the product distribution network [11]. While Tor’s untraceable       in the following sub-sections.
nature makes linking DNM users to their true identity is a non-
trivial task, studying their behavior can provide an in-depth        A. Hacker Community Platform Identification
understanding about the underground economy.                            The first stage in any data collection task is identifying the
4) Carding Shops                                                     appropriate data sources. We use three approaches to identify
                                                                     hacker platforms: suggestions from cybersecurity experts,
   Carding shops facilitate many underground economy                 surface web and Tor search engines, and snowball identification.
activities by providing high quality carding services [4]. A large   Using all three ensure a comprehensive, high-quality coverage.
amount of stolen card data from attained from data breaches are      Irrespective of approach, we only collected platforms
traded and diffused through carding shops [12]. Similar to           containing significant amounts of cybersecurity content. We
DNMs, access to carding shops are often indicated in other           deliberately avoided platforms specializing in weapons or
hacker community platforms. Since carding data are duplicable        pornography, as such content has minimal CTI value.
and recyclable, the rapid dissemination of card information can
inflict significant financial losses on cardholders.                    In the first strategy, our team consulted with the National
                                                                     Cyber-Forensics Training Alliance (NCFTA), a major non-
   Carding shop data can be divided into payment card data,          profit organization focusing on the CTI sharing across private,
identity data, and credential data [13]. Payment card data is the    public, and academic sectors, and Policing in Cyberspace
major product type in carding shop, and it can be further            (POLCYB), an internationally recognized law enforcement
categorized into “Dumps” and “Fullz” based on the amount of          entity. Both suggested platforms providing valuable
information carried by the product. Dumps refer to the raw           cybersecurity data and also recommended keywords as input
information retrieved from the magnetic stripe of the card. Fullz    for surface web and Tor-based (e.g., Grams, Deep Dot Web)
includes the full information of the victim, including name,         search engines to identify additional platforms. Since hackers
address. Both Dumps and Fullz contain three sections: card           often information on traditional social media platforms (e.g.,
information, source, and price. Beyond these two, SSNs and           Twitter, Facebook, YouTube), these keywords were also
logins are also commonly found on carding shops.                     inputted into these sites to identify additional platforms. Figure
    Carding shops have unique CTI value as they provide a            5 depicts a YouTube video of Anonymous recruiting for their
comprehensive view of carding fraud. Despite its importance,         IRC channel, #OpTestet.
little academic literature exists about them [4]. Past researchers
have analyzed relationships between each attribute and price by
comparison to identify that card data is packaged, with a label,
to periodically release on carding shop [12]. Despite this useful
discovery, significant potential remains for carding shop data to
be used as a source of identifying exploited individuals and
financial institutions.
B. Hacker Community Data Collection: Existing Approaches
and Challenges
   Unlike traditional social media sites, researchers face
numerous issues when collecting hacker community platforms.
Web-based platforms (i.e., forums, shops, DNMs) employ anti-
crawling measures such as drive-by malware, session timers,
user-agent checks, CAPTCHA, and others. IRC data must be                Figure 5. An example of a recruiting video of Anonymous on YouTube
collected in real-time. These challenges have limited many
researchers to manual collection efforts, or to studying old            The platforms identified from the first two approaches were
datasets (e.g., dumped SilkRoad, archived forums). While still       used as “seeds” for our final strategy: snowball identification.
valuable, such procedures result in incomplete and/or dated CTI      Hackers within these platforms often post links to other
insights. Studies employing automated collection procedures          platforms. We followed these links and identified if they
are often limited to one platform type (e.g., forums). However,      contained valuable cybersecurity content. The newly identified
each hacker community platform is interconnected with others.        platforms were used as new seeds to identify additional
Thus, the focus on collecting one platform prevents                  platforms.
                                                                                                                                                      4
B. Enhanced Automated Collection Procedures                              sometimes earlier than other platforms. Hence, a promising
                                                                         research direction would be developing time-sensitive methods
   Collection processes for hacker community data vary. Past
web-based platforms (i.e., forums, DNMs, carding shops)                  to analyze the trends of cyber threat landscape through constant
research usually conducted undirected web crawlers to collect            monitoring of the forum data. Another direction would be
the raw data in the HTML format. To address the anti-crawling            cross-referencing the forum data with DNMs in order to have
challenges detailed in our review, we upgraded our crawler to            holistic trend analysis. Moreover, due to the interactive
directly collect web pages and contents. Moreover, we utilized           structure of these platforms, they are capable of revealing the
packet capture technology to overcome the challenge that some            interaction network of cyber criminals.
of websites restrict user to skip specific pages. By flexibly            B. IRC
switching types of HTTP request, we significantly reduced the
time cost on crawling web-based hacker platform data. For IRC              We collected 2,791,120 lines of conversation from 13 IRC
data, we employed two “bots,” similar to fake users, inside each         cybersecurity specific IRC channels between 9/2016 and
channel, and used these bots to log in at their own routines, to         1/2018. The data collection is summarized in Table III. For
avoid automatic disconnections.                                          space consideration, we only listed the top six channels.
C. Content Parsing                                                               TABLE III. IRC DATA COLLECTION SUMMARY
                                                                            Channel         # of lines                  Description
   After data collection, the collected raw data requires further                                          General discussion of hacking-related
                                                                            #anonops        1,696,024                     topics
parsing to enable subsequent analytics. For forums, DNMs,
carding shops, and IRC, parsing entails recognizing text                       #ed          574,024           Discussion about current topics
patterns containing relevant attributes (e.g., post date, product           #hackers        174,328       General discussion of tips and tricks for
                                                                                                                   Anonymous hackers
description). We developed several custom parser programs
                                                                            #Evilzone       163,402        Casual discussions on cyber security
and leveraged techniques such as Regular Expression (RegEx)
to retrieve information from those platforms, and stored them                 #ddos          23,172           Posts about current ddos tools
                                                                                                          recommended by Anonymous hackers
into a relational database.                                                                                 Offers selected members tutorials
                                                                            #tutorials       77,903         through a separate interactive IRC
               IV. DATA COLLECTION OVERVIEW                                                                              channel
                                                                           Total (of all    2,791,120                         -
  In our hacker community data collection, we successfully                  channels)
collected 51 hacker forums, 13 IRC channels, 12 DNMs, and
11 carding shops. Table II summarizes our collection.                       The most popular IRC channel “anonops” is the main
                                                                         channel of the famous hacktivist group, Anonymous.
 TABLE II. HACKER COMMUNITY DATA COLLECTION SUMMARY
                                                                         Anonymous also runs channels such as “ddos,” which focuses
    Platform      # of Platform    # of Records        Languages
                                                                         on Distributed Denial-of-Service (DDoS), and “hacker,” in
     Forums        51 forums      3,643,216 posts
                                                     English/ Russian/   which users share and ask for hacking tips. IRC users also
                                                          Arabic         demand/provide hacking services with target information to
                                  2,791,120 lines                        each other. While past studies have explored the IRC
      IRC          13 channels                           English
                                  of conversation                        participant duration [14], the CTI value of IRC data is still
                                                     English/ Russian/   undiscovered. The links, URLs, and named entities exchanged
     DNM           12 markets     291,616 listings
                                                          French         in IRC chatrooms can be used in a snowball sampling manner
                                    8,674,087
                                                                         to expand cyber threat resources. After identifying resources, a
  Carding Shops     26 shops                             English         promising research direction would be discovering
                                     listings
                                                                         conversation topics via topic modeling approaches (e.g., Latent
A. Hacker Forums                                                         Dirichlet Allocation). Techniques such as Named Entity
  We focused on collecting 51 hacking oriented forums                    Recognition (NER) and relationship extraction can identify the
containing 32,266,852 posts in 2,961,363 threads. 25,939,871             targets of hackers and hacktivists in IRC.
are in English, 5,975,821 are in Russian, and 2,624,658 are in           C. DNM
Arabic. Generally, we observed high frequency of
hacking/security tools, for instance, online shopping site receipt         We collected 12 DNMs between September, 2016 and
generators for phishing purpose. Some forums specialize in               January, 2018. Table IV summarizes our DNM data collection.
other services such as breached data, mobile malware,                             TABLE IV. DNM DATA COLLECTION SUMMARY
cryptocurrencies, login dumps, and code for AI bots. The                        DNM               # of    # of security           Language
multilingual feature of our collection can facilitate                                           listing      listing
cybersecurity research in cross-countries comparison.                           0day            28,330       28,330+                English
   Due to the popularity and ease of access, hacker forum data                Alphabay          25,118         N/A                  English
has a unique advantage that might not be seen in other hacker               Apple Market        2,012          N/A                  English
community platforms. The prolific nature of forum as well as
                                                                            Dream Market       120,962       1,916+                 English
their dynamic and time-sensitive property enables the
researcher to identify trends of cyber threats easier and                 French Deep Web       1,536         134+                  French
                                                                                                                                              5
interested in identifying whether their credit/debit card was                         this paper, we summarized our efforts to systematically identify
stolen can search their name. Should it appear, they can                              and collect four major hacker community platforms: hacker
pinpoint which shop it is sold in, the card price, and others.                        forums, IRC, carding shops, and DarkNet Marketplaces.
                                                                                      Through our methodology, we collected 102 platforms a total
   Beyond these core functionalities, we developed
                                                                                      of 43,902,913 records. This large-scale collection enables
visualizations for users to interactively explore the data.
                                                                                      numerous novel CTI research possibilities, including cross
Carefully constructed based on internal feedback provided by
                                                                                      platform analysis, multi-lingual analysis, and others. Each
SFS students and external suggestions from NCFTA and
                                                                                      direction can significantly improve our understanding of the
POLCYB, each visualization can be filtered and sorted based
                                                                                      hacker community, create proactive CTI, and most importantly,
on a user’s needs. Figures 7 provides one panel of our carding
                                                                                      help develop a more secure society.
shop dashboard, which provides an overview of active and
expired cards as well as a geo-spatial breakdown of their                                              ACKNOWLEDGEMENTS
locations and associated frequencies.                                                 This work was supported in part by the National Science
                                                                                      Foundation (NSF) DUE-1303362 (SFS), SES-1314631
                                                                                      (SaTC), ACI-1443019 (DIBBs), and 1719477 (EAGER).
                                                                                                                      REFERENCES
                                                                                        [1]    Graham, L. 2017. “Cybercrime costs the global economy $450
                                                                                               billion,”           CEO.             7            Feb.             2017.
                                                                                               https://www.cnbc.com/2017/02/07/cybercrime-costs-the-global-
                                                                                               economy-450-billion-ceo.html
                                                                                        [2]    Bromiley, M. (2016). “Threat Intelligence: What it is, and how to use
                                                                                               it effectively.” SANS Institute. https://www.sans.org/reading-
                                                                                               room/whitepapers/analyst/threat-intelligence-is-effectively-37282
                                                                                        [3]    Shackleford, D. (2016). “Security Analytics Survey” SANS Institute.
                                                                                               https://www.sans.org/reading-room/whitepapers/analyst/2016-
                                                                                               security-analytics-survey-37467
                                                                                        [4]    Benjamin, V., Li, W., Holt, T., & Chen, H. (2015, May). Exploring
                                                                                               threats and vulnerabilities in hacker web: Forums, IRC and carding
                                                                                               shops. In Intelligence and Security Informatics (ISI), 2015 IEEE
                                                                                               International Conference on (pp. 85-90). IEEE.
                                                                                        [5]    Samtani, S., Chinn, R., Chen, H., & Nunamaker Jr, J. F. (2017).
Figure 7. (a) scorecard of active and expired cards, (b) locations, (3) search,                Exploring Emerging Hacker Assets and Key Hackers for Proactive
sort, and filter functions, and (d) frequency of cards based on zip code                       Cyber Threat Intelligence. Journal of Management Information
                                                                                               Systems, 34(4), 1023-1053.
   Users can get more in-depth intelligence on selected                                 [6]    Benjamin, V., Zhang, B., Nunamaker Jr, J. F., & Chen, H. (2016).
platforms with finer grained visualizations. Figure 8 provides                                 Examining hacker participation length in cybercriminal Internet-relay-
an example of a carding shop dashboard allowing users to                                       chat communities. Journal of Management Information Systems,
dynamically compare shops based on frequency of cards,                                         33(2), 482-510.
                                                                                        [7]    Christin, N. (2013, May). Traveling the Silk Road: A measurement
average prices, and banks of stolen cards.                                                     analysis of a large anonymous online marketplace. In Proceedings of
                                                                                               the 22nd international conference on World Wide Web (pp. 213-224).
                                                                                               ACM.
                                                                                        [8]    Corazza, O., Schifano, F., Simonato, P., Fergus, S., Assi, S., Stair, J.,
                                                                                               ... & Blaszko, U. (2012). Phenomenon of new drugs on the Internet:
                                                                                               the case of ketamine derivative methoxetamine. Human
                                                                                               Psychopharmacology: Clinical and Experimental, 27(2), 145-149.
                                                                                        [9]    Van Hout, M. C., & Bingham, T. (2014). Responsible vendors,
                                                                                               intelligent consumers: Silk Road, the online revolution in drug trading.
                                                                                               International Journal of Drug Policy, 25(2), 183-189.
                                                                                        [10]   Davey, Z., Schifano, F., Corazza, O., Deluca, P., & Psychonaut Web
                                                                                               Mapping Group. (2012). e-Psychonauts: conducting research in online
                                                                                               drug forum communities. Journal of Mental Health, 21(4), 386-394.
                                                                                        [11]   Broséus, J., Rhumorbarbe, D., Mireault, C., Ouellette, V., Crispino, F.,
                                                                                               & Décary-Hétu, D. (2016). Studying illicit drug trafficking on Darknet
                                                                                               markets: structure and organisation from a Canadian perspective.
                                                                                               Forensic science international, 264, 7-14.
                                                                                        [12]   Bulakh and M. Gupta, "Characterizing credit card black markets on the
                                                                                               web," in 2015, . DOI: 10.1145/2740908.2742128.
                                                                                        [13]   Li, W., Yin, J., Chen, H. (2016). Identifying high quality carding
    Figure 8. (a) frequency of cards per shop, (b) banks of stolen cards, (c)                  services in underground economy using nonparametric supervised
average card prices, (d) filter capabilities, and (e) card issuers with most stolen            topic model. International Conference on Information Systems.
                                       cards                                                   Dublin, Republic of Ireland
                                                                                        [14]   Benjamin, V., & Chen, H. (2014, September). Time-to-event modeling
            VI. CONCLUSION AND FUTURE DIRECTIONS                                               for predicting hacker IRC community participant trajectory.
                                                                                               In Intelligence and Security Informatics Conference (JISIC), 2014
  Cybersecurity has become a societal concern. CTI can                                         IEEE Joint (pp. 25-32). IEEE.
provide a valuable process to pinpoint and defend against                               [15]   Samtani, S., Chinn, K., Larson, C., and Chen, H. “AZSecure Hacker
cyber-attacks. Today, the online hacker community has                                          Assets Portal: Cyber threat intelligence and Malware Analysis” In
                                                                                               Intelligence and Security Informatics (ISI), 2016 IEEE International
emerged as a valuable data source to generate proactive CTI. In                                Conference on (pp. 19-24). IEEE.