Raid20 Rokon
Raid20 Rokon
Repositories in GitHub
Abstract
Where can we find malware source code? This question is
motivated by a real need: there is a dearth of malware source
code, which impedes various types of security research. Our
work is driven by the following insight: public archives, like
GitHub, have a surprising number of malware repositories.
Capitalizing on this opportunity, we propose, SourceFinder,
a supervised-learning approach to identify repositories of
malware source code efficiently. We evaluate and apply our
approach using 97K repositories from GitHub. First, we show
that our approach identifies malware repositories with 89%
precision and 86% recall using a labeled dataset. Second, we
use SourceFinder to identify 7504 malware source code repos-
itories, which arguably constitutes the largest malware source Figure 1: The steps of our work as a funnel: We identify 7.5K
code database. Finally, we study the fundamental properties malware source code repositories in GitHub starting from
and trends of the malware repositories and their authors. The 32M repositories based on 137 malware keywords (Q137).
number of such repositories appears to be growing by an order
of magnitude every 4 years, and 18 malware authors seem to
be "professionals" with a well-established online reputation.
tories, but this has not yet been explored to provide security
We argue that our approach and our large repository of mal-
researchers with malware source code. In this work, we focus
ware source code can be a catalyst for research studies, which
on GitHub which is arguably the largest software storing and
are currently not possible.
sharing platform. As of October 2019, GitHub reports more
1 Introduction than 34 million users [25] and more than 32 million public
Security research could greatly benefit by an extensive repositories [24]. As we will see later, there are thousands
database of malware source code, which is currently unavail- of repositories that have malware source code, which seem
able. This is the assertion that motivates this work. First, to have escaped the radar of the research community so far.
security researchers can use malware source code to: (a) un- We use a broad definition of malware to include any reposi-
derstand malware behavior and techniques, and (b) evaluate tory containing software that can participate in compromising
security methods and tools. In the latter, having the source devices and supporting offensive, undesirable and parasitic
code can provide the groundtruth for assessing the effective- activities.
ness of different techniques, such as reverse engineering meth- Why do authors create public malware repositories? This
ods. Second, currently, a malware source code database is not question mystified us: these repositories expose both the cre-
readily available. By contrast, there are several databases with ators and the intelligence behind the malware. Intrigued, we
malware binary code, as collected via honeypots, but even conducted a small investigation on malware authors, as we
those are often limited in number and not widely available. discuss below.
We discuss existing malware archives in Section 9. Problem: How can we find malware source code reposito-
A missed opportunity: Surprisingly, software archives, ries in a large archive, like GitHub? The input to the problem
like GitHub, host many publicly-accessible malware reposi- is an online archive and the desired output is a database of
USENIX Association 23rd International Symposium on Research in Attacks, Intrusions and Defenses 149
malware repositories. The challenges include: (a) collecting cides to deactivate them. We also create a curated database of
an appropriate set of repositories from the potentially vast 250 malware repositories, manually verified and spanning a
archive, and (b) identifying the repositories that contain mal- wide range of malware types.
ware. Optionally, we also want to further help researchers d. The number of new malware repositories in our data
that will potentially use these repositories, by determining more than triples every four years. The increasing trend is
additional properties, such as the most likely target platform, interesting and alarming at the same time.
the malware type or family etc. Another practical challenge e. We identify popular and influential repositories. We
is the need to create the ground truth for validation purposes. study the malware repositories using three metrics of pop-
Related work: To the best of our knowledge, there does ularity: the number of watchers, forks and stars. We find 8
not seem to be any study focusing on the problem above. We repositories that dominate the top-5 lists for all three metrics.
group related works in the following categories. First, several f. We identify prolific and influential authors. We find
studies analyze software repositories to find usage and limi- that 3% of the authors have more than 300 followers. We
tations without any focus on malware [14]. Second, several also find that 0.2% of the authors have more than 7 malware
efforts create and maintain databases of malware binaries but repositories, with the most prolific author cyberthreats having
without source code [2, 3]. Third, many efforts attempt to ex- created 336 repositories.
tract higher-level information from binaries, such as lifting to g. We identify and profile 18 professional hackers. We
Intermediate Representation (IR) [20], but it is really difficult find 18 authors of malware repositories, who seem to have
to re-create the source code [10]. In fact, such studies would created a brand around their activities, as they use the same
benefit from our malware source-code archive to evaluate and user names in security forums. For example, user 3vilp4wn
improve their methods. Taking a software engineering angle, (pronounced evil-pawn) is the author of a keylogger malware
an interesting work [8] compares the evolution of 150 mal- in GitHub, which the author is promoting in the Hack This
ware source code repositories with that of benign software. Site forum using the same username. We present our study of
We discuss related works in Section 9. malware authors in Section 7.
Contributions: Our work is arguably the first effort to sys- Open-sourcing for maximal impact: creating an en-
tematically identify malware source code repositories from a gaged community. We intend to make our datasets and our
massive public archive. The contribution of this work is three- tools available for research purposes at our website [28].
fold: (a) we propose SourceFinder, a systematic approach Our vision is to create community-driven reference platform,
to identify malware source-code repositories with high pre- which will provide: (a) malware source code repositories, (b)
cision, (b) we create, arguably, the largest non-commercial community-vetted labels and feedback, and (c) open-source
malware source code archive with 7504 repositories, and (c) tools for collecting and analyzing malware repositories. Our
we study patterns and trends of the repository ecosystem in- goal is to expand our database with more software archives
cluding temporal and author-centric properties and behaviors. and richer information. Although authors could start hiding
We apply and evaluate our method on the GitHub archive, their repositories (see Section 8), we argue that our already-
though it could also be used on other archives, as we discuss retrieved database could have significant impact in enabling
in Section 8. certain types of security studies [22, 29, 32].
Our key results can be summarized in the following points,
and some key numbers are shown in Figure 1. 2 Background
a. We collect 97K malware-related repositories from We provide background information on GitHub and the type
GitHub, namely repositories retrieved using malware key- of information that repositories have.
words through GitHub’s API and employing techniques to GitHub is a massive world-wide software archive, which
overcome several limitations. We also generate an extensive enables users to share code through its public repositories thus
groundtruth with 2013 repositories, as we explain in Section 3. creating a global social network of interaction. For instance,
b. SourceFinder achieves 89% precision. We systemati- first, users can collaborate on a repository. Second, users
cally consider different Machine Learning approaches, and often "fork" projects: they copy and evolve projects. Third,
carefully-created representations for the different fields of the users can follow projects, and "up-vote" projects using "stars"
repository, such as title, description etc. We then systemati- (think Facebook likes). Although GitHub has many private
cally evaluate the effect of the different features, as we discuss repositories, there are 32 million public software repositories.
in Section 5. We show that we classify malware repositories We describe the key elements of a GitHub repository. A
with a 89% precision, 86% recall and 87% F1-score using repository is equivalent to a project folder, and typically, each
five fields from the repository. repository corresponds to a single software project. However,
c. We identify 7504 malware source-code repositories, a repository could contain: (a) source code, (b) binary code,
which is arguably the largest malware source-code database (c) data, (d) documents, such as latex files, and (e) all of the
available to the research community. We have already down- above.
loaded the contents in these repositories, in case GitHub de- A repository in GitHub has the following data fields: a) title,
150 23rd International Symposium on Research in Attacks, Intrusions and Defenses USENIX Association
b) description, c) topics, d) README file, e) file and folders, Set Descriptions Size
f) date of creation and last modified, g) forks, h) watchers, Q1 Query set = {"malware"} 1
i) stars, and j) followers and followings, which we explain Q50 Query with 50 keywords with Q1⊂Q50 50
below. Q137 Query with 137 keywords with Q50⊂Q137 137
a. Repository title: The title is a mandatory field and it RD1 Retrieved repositories from query Q1 2775
usually consists of less than 3 words. RD50 Retrieved repositories from query Q50 14332
b. Repository description: This is an optional field that RD137 Retrieved repositories from query Q137 97375
describes the objective of the project and it is usually 1-2 LD1 Labeled subset of RD1 dataset 379
sentences long. LD50 Labeled subset of RD50 dataset 755
c. Repository topics: An author can optionally provide LD137 Labeled subset of RD137 dataset 879
topics for her repository, in the form of tags, for example, M1 Malware source code repositories in RD1 680
"linux, malware, malware-analysis, anti-virus". Note that 97% M50 Malware source code repositories in RD50 3096
of the repositories in our dataset have less than 8 topics. M137 Malware source code repositories in RD137 7504
d. README file: As expected, the README file is a MCur Manually verified malware source code 250
documentation and/or light manual for the repository. This dataset
field is optional and its size varies from one or two sentences
to many paragraphs. For example, we found that 17.48% of Table 1: Datasets, their relationships, and their size.
the README files in our repositories are empty.
e. File and folders: In a well-constructed software, the
file and folder names of the source code can provide useful subtleties and challenges, which we discuss below.
information. For example, some malware repositories contain Using the GitHub Search API, a user can query with a set
files or folders with indicative names, such as "malware", of keywords and obtain the most relevant repositories. We
"source code" or even specific malware types or names of describe briefly how we select appropriate keywords, retrieve
specific malware, like mirai. related repositories from GitHub and how we establish our
f. Date of creation and last modification: GitHub main- ground truth.
tains the date of creation and last modification of a repository. A. Selecting keywords for querying: In this step, we
We find malware repository created in 2008 are actively being want to retrieve repositories from GitHub in a way that: (a)
modified by authors till present. provides as many as possible malware repositories, and (b)
g. Number of forks: Users can fork a public repository: provides a wide coverage over different types of malware.
they can create a clone of the project. An user can fork any For this reason, we select keywords from three categories:
public repository to change locally and contribute to the origi- (a) malware and security related keywords, such as malware
nal project if the owner accepts the modification. The number and virus, (b) malware type names, such as ransomware and
of forks is an indication of the popularity and impact of a keylogger, and (c) popular malware names, such as mirai. Due
repository. Note that the number of forks indicates the num- to space limitations, we will provide the full list of keywords
ber of distinct users that have forked a repository. in our website at publication time for repeatability purposes.
h. Number of watchers: Watching a repository is equiva- We define three sets of keywords that we use to query
lent to "following" in the social media language. A "watcher" GitHub. The reason is that we want to assess the sensitivity of
will get notifications, if there is any new activity in that project. the number of keywords on the outcome. Specifically, we use
The numbers of watchers is an indication of the popularity of the following query sets: (a) the Q1 set, which only contains
a repository [16]. the keyword "malware"; (b) the Q50 set, which contains 50
i. Number of stars: A user can "star" a repository, which keywords, and (c) the Q137 set which contains 137 keywords.
is equivalent to the "like" function in social media [5], and The Q137 keyword set is a super-set of Q50, and Q50 is a
places the repository in the users favorite group, but does not superset of Q1. As we will see below, using the query set Q137
provide constant updates as with the "watching" function. provides wider coverage, and we recommend in practice. We
use the other two to assess the sensitivity of the results in the
j. Followers: Users can also follow other users’ work. If
initial set of keywords. We list our datasets in Table 1.
A follows B, A will be added to B’s followers and B will be
added to A’s following list. The number of followers is an B. Retrieving related repositories: Using the Search API,
indication of the popularity of a user [39]. we query GitHub with our set of keywords. Specifically, we
query GitHub with every keyword in our set separately. In an
3 Data Collection ideal world, this would have been enough to collect all related
repositories: a query with "malware" (Q1) should return the
The first step in our work is to collect repositories from many thousands related repositories, but this is not the case.
GitHub that have a higher chance of being related to malware. The search capability hides several subtleties and limita-
Extracting repositories at scale from GitHub hides several tions. First, there is a limit of 1000 repositories that a single
USENIX Association 23rd International Symposium on Research in Attacks, Intrusions and Defenses 151
Labeled Dataset Malware Repo. Benign Repo. that we only found very few duplicates in the order of 3-5 in
LD137 313 566
LD50 326 429
each dataset with hundreds of repositories.
LD1 186 193 With this process, we establish three separate labeled
datasets named LD137, LD50, and LD1 starting from the
Table 2: Our groundtruth: labeled datasets for each of the respective malware repositories from each of our queries, as
three queries, for a total of 2013 repositories. shown in Table 2. Although the labeled datasets are not 50-
50, they are representing both classes reasonably well, so
that a naive solution that will label everything as one class,
search can return: we get the top 1000 repositories ordered would perform poorly. By contrast, our approach performs
by relevancy to the query. Second, the GitHub API allows 30 sufficiently well, as we will see in Section 5.
requests per minute for an authenticated user and 10 requests As there is no available dataset, we argue that we make a
per minute for an unauthenticated user. sufficient size dataset by manual effort.
Bypassing the API limitations. We were able to find a work
around for the first limitation by using ranking option. Namely, 4 Overview of our Identification Approach
a user can specify her preferred ranking order for the results Here, we describe our supervised learning algorithm to iden-
based on: (a) best match, (b) most stars, (c) fewest stars, (d) tify the repositories that contain malware.
most forks, (e) fewest forks, (f) most recently updated, and Step 1. Data preprocessing: As in any Natural Language
(g) the least recently updated order. By repeating a query Processing (NLP) method, we start with some initial process-
with all these seven ranking options, we can maximize the ing of the text to improve the effectiveness of the solution.
number of distinct repositories that we get. This way, for We briefly outline three levels of processing functionality.
each keyword in our set, we search with these seven different a. Character level preprocessing: We handle the char-
ranking preferences to obtain a list of GitHub repositories. acter level "noise" by removing special characters, such as
C. Collecting the repositories: We download all the repos- punctuation and currency symbols, and fix Unicode and other
itories identified in our queries using PyGithub [52], and encoding issues.
we obtain three sets of repositories RD1, RD50 and RD137. b. Word level preprocessing: We eliminate or aggregate
These retrieved datasets have the same "subset" relationship words following the best practices of Natural Language Pro-
that they query sets have: RD1 ⊂ RD50 ⊂ RD137. Note that cessing [33]. First, we remove article words and other words
we remove pathological repositories, mainly repositories with that don’t carry significant meaning on their own. Second, we
no actual content, or repositories "deleted" by GitHub. For use a stemming technique to handle inflected words. Namely,
each repository, we collect and store: (a) repository-specific we want to decrease the dimensionality of the data by group-
information, (b) author-specific information, and (c) all the ing words with the same "root". For example, we group the
code within the repository. words "organizing", "organized", "organize" and "organizes"
As we see from Table 1, using more and specialized to one word "organize". Third, we filter out common file and
malware keywords returns significantly more repositories. folder names that we do not expect to help in our classification,
Namely, searching with the keyword "malware" does return such as "LEGAL", "LICENSE", "gitattributes" etc.
2775 repositories, but searching with the Q50 and Q137 re- c. Entity level filtering: We filter entities that are likely
turns 14332 and 97375 repositories respectively. not helpful in describing the scope of a repository. Specifi-
D. Establishing the groundtruth: As there was no avail- cally, we remove numbers, URLs, and emails, which are often
able groundtruth, we needed to establish our own. As this is found in the text. We found that this filtering improved the
a fairly technical task, we opted for domain experts instead classification performance. In the future, we could consider
of Mechanical Turk users, as recommended by recent stud- mining URLs and other information, such as names of people,
ies [23]. We use three computer scientists to manually label companies or youtube channels, to identify authors, verify
1000 repositories, which we selected in a uniformly random intention, and find more malware activities.
fashion, from each of our dataset RD137 and RD50 and 600 Step 2. The repository fields: We consider fields from the
repositories from RD1. The judges were instructed to inde- repositories that can be numbers or text. Text-based fields
pendently investigate every repository thoroughly. require processing in order to turn them into classification
Ensuring the quality of the groundtruth. To increase the features and we explain this below. We use and evaluate the
reliability of our groundtruth, we took the following measures. following text fields: title, description, topics, file and folder
First, we asked judges to label a repository only, if they were names and README file fields.
certain that it is malicious or benign and distinct, and leave it Text field representation: We consider two techniques to
unlabeled otherwise. We only kept the repositories for which represent each text field by a feature in the classification.
the judges agreed unanimously. Second, duplicate repositories a. Bag of Words (BoW): The bag-of-words (BoW) model
were removed via manual inspection, and were excluded from is among the most widely used representations of a document.
the final labeled dataset to avoid overfitting. It is worth noting The document is represented as the number of occurrences of
152 23rd International Symposium on Research in Attacks, Intrusions and Defenses USENIX Association
its words, disregarding grammar and word order [75]. This repositories. By June 2020, GitHub started labeling reposito-
model is commonly used in document classification where the ries that contain source code. Therefore, one can simply filter
frequency of each word is used as feature value for training out all repositories that are not labelled as such.
a classifier [42]. We use the model with the count vectorizer As our study predates this GitHub feature, we developed a
and TF-IDF vectorizer to create the feature vector. heuristic approach to identify source code repositories inde-
In more detail, we represent each text field in the repository pendently, which we describe below. Our heuristic exhibits
with a vector V [K], where V [i] corresponds to the significance 100% precision as validated by GitHub’s classification, as we
of word i for the text. There are several ways to assign val- will see in Section 5.
ues V [i]: (a) zero-one to account for presence, (b) number Our source-code classification heuristics works in two steps.
of occurrences, and (c) the TF-IDF value of the word. We First, we identify files in the repository that contain source
evaluated all the above methods. code. To do this, we start by examining their file extension. If
Fixing the number of words per field. To improve the ef- the file extension is one of the known programming languages:
fectiveness of our approach using BoW, we conduct a feature Assembly, C, C++, Batch File, Bash Shell Script, Power Shell
selection process, χ2 statistic following best practices [55]. Script, Java, Python, C#, Objective-C, Pascal, Visual Basic,
The χ2 statistic measures the lack of independence between a Matlab, PHP, Javascript, and Go, we label it as a source file.
word (feature) and a class. A feature with lower chi-square Second, if the number of source files in a repository exceeds
score is less informative for that class, and thus not useful in the Source Percentage threshold (SourceThresh), we con-
the classification. We discuss this further in Section 5. For sider that the repository contains source code.
each text-based field f , we select the top K f words for that
field, which exhibit the highest discerning power in identi- 5 Evaluation: Choices and Results
fying malware repositories. Note that we set a value for K f In this section, we evaluate the effectiveness of the classifica-
during the training stage For each field, we select the value tion based on the proposed methodology defined in Section 4.
K f , as we explain in Section 5. More specifically, our goal here is to answer the following
b. Word embedding: The word embedding model is a questions:
vector representations of each word in a document: each word
is mapped to an M-dimensional vector of real numbers [44], 1. Repository field selection: Which repository fields
or equivalently are projected in an M-dimensional space. A should we consider in our analysis?
good embedding ensures that words that are close in meaning 2. Field representation: Which feature representation is
have nearby representations in the embedded space. In order better between bag of words (BoW) and word embedding
to create the document vector, word embedding follows two and considering several versions of each?
approaches (i) frequency-based vectorizer(unsupervised) [58] 3. Feature selection: What are the most informative fea-
and (ii) content-based vectorizer(supervised) [38]. Note that tures in identifying malware repositories?
in this type of representation, we do not use the word level 4. ML algorithm selection: Which ML algorithm exhibits
processing, which we described in the previous step, since the best performance?
this method can leverage contextual information. 5. Classification effectiveness: What is the precision, re-
We use frequency-based word embedding with word aver- call and F1-score of the classification?
age and TF-IDF vectorizer. We also use pre-trained model of 6. Identifying malware repositories: How many malware
Google word2vec [43] and Stanford (Glov) [49] to create the repositories do we find?
feature vector. 7. Identifying malware source code repository: How
Finally, we create the vector of the repository by concate- many of the malware repositories have source code?
nating the vectors of each field of that repository.
Step 3. Selecting the fields: Another key question is which Note that we have a fairly complex task: we want to iden-
fields from the repository to use in our classification. We tify the best fields, representation method and Machine Learn-
experiment with all of the fields listed in Section 2 and we ing engine, while considering different values for parameters.
explain our findings in the next Section. What complicates matters is that all these selections are in-
Step 4. Selecting a ML engine: We design ML model to terdependent. We present our analysis in sequence, but we
classify the repositories into two classes: (i) malware repos- followed many trial and error and non-linear paths in reality.
itory and (ii) benign repository. We systematically evaluate 1. Selecting repository fields: We evaluated all the repos-
many machine learning algorithms [7, 45]: Naive Bayes (NB), itory fields mentioned earlier. In fact, we used a significant
Logistic Regression (LR), Decision Tree (CART), Random number of experiments with different subsets of the features,
Forest(RF), K-Nearest Neighbor (KNN), Linear Discriminant not shown here due to space limitations. We find that the title,
Analysis (LDA), and Support Vector Machine (SVM). description, topics, README file, and file and folder names
Step 5. Detecting source code repositories: In this final have the most discerning power. We also considered number
step, we want to identify the presence of source code in the of forks, watchers, and stars of the repository and the number
USENIX Association 23rd International Symposium on Research in Attacks, Intrusions and Defenses 153
Representation Classification 375, 400, 425, 450 and 475 for the description field. We find
Accuracy
Range
that the top 100 words for file and folder names and top 400
Bag of Words with Count Vectorizer 86%-51% words for description field give the highest accuracy. Note
Bag of Words with Count Vectorizer + Feature 91%-56% that we do this during training and refining the algorithm, and
Selection then we continue to use these words as features in testing.
Bag of Words with TF-IDF vectorizer 82%-63%
Thus, we select the top: (a) 30 words from the title, (b) 10
Word Embedding with Word Average 85%-72%
Word Embedding with TF-IDF 85%-74%
words from the topics, (c) 400 words from the description,
Pretrained Google word2vec Model 76%-64%
(d) 100 words from the file names, and 10 words from the
Pretrained Stanford (Glov) Model 73%-62% README file. This leads to a total of 550 words across
all fields. For reference, we find 9253 unique words in the
Table 3: Selecting the feature representation model: We eval- repository fields of our training dataset. Reducing the focus
uate all the representations across seven machine learning on the top 550 most discerning words per field increases the
approaches and report the range of the overall accuracy. classification accuracy by as much as 20% in some cases.
4. Evaluating and selecting ML algorithms: We find that
Multinomial Naive Bayes exhibits the best F1-score with 87%,
of followers and followings of the author of the repository. striking a good balance between 89% precision and 86%
We found that not only it did not help, but it usually decreased recall for the malware class among other machine learning
the classification accuracy by 2-3%. One possible explanation classifier which we considered. Detecting the benign class,
is that the numbers of forks, stars and followers reflect the we do even better with 92% precision, 94% recall and 93%
popularity rather than the content of a repository. F1-score. By contrast, the F1-score of the other algorithms is
2. Selecting a field representation: The goal is to find, below 79%. Note that KNN, LR and LDA methods provide
which representation approach works better. In Table 3, we higher precision, but with significantly lower recall. Thus, one
show the comparison of the range of classification accuracy could use these algorithms to get higher precision at the cost
across the 7 different ML algorithms that we will also consider of lower total number of repositories.
below. We find that Bag of Words with the count vectorizer We use Multinomial Naive Bayes as our classification en-
representation reaches 86% classification accuracy, with the gine for the rest of this study. We attempt to explain the
word embedding approach nearly matching that with 85% superior F1-Score of the Naive Bayes in our context. The
accuracy. Note that we finetune the selection of words to main advantage of Naive Bayes over other algorithms is that
represent each field in the next step. it considers the features independently of each other for a
Why does not the embedding approach outperform the given class and can handle large number of features better.
bag of words? One would have expected that the most com- As a result, it is more robust to noisy or unreliable features.
plex embedding approach would have been the winner and It also performs well in domains with many equally impor-
by a significant margin. We attribute this to the relatively tant features, where other approaches suffer, especially with a
small text size in most text fields, which also do not provide small training data, and it is not prone to overfitting [64]. As a
well-structured sentences (think two-three words for the title, result, the Naive Bayes is considered a dependable algorithm
and isolated words for the topics). Furthermore, the word co- for text classification and it is often used as the benchmark to
occurrences does not exist in the topics and file names fields, beat [71].
which is partly what makes embedding approaches work well 5. Assessing the effect of the query set: We have made
in large and well structured documents [26, 41]. the following choices in the previous steps: (a) 5 text-based
In the rest of this paper, we use the Bag of Words with fields, (b) bag of words with count vectorization, (c) 550 total
count vectorizer to represent our text fields, since it exhibits words across all the fields, and (d) the Multinomial Naive
good performance and is computationally less intensive than Bayes. We perform 10-fold cross validation and report the
the embedding method. precision, recall and F1-score in Figure 2 for our three differ-
3. Fixing the number of words per field. We want to ent labeled data sets. We see that the precision stays above
identify the most discerning words from each text field, which 89% for all three datasets, with a recall above 77%.
is a standard process in NLP for improving the scalability, It is worth noting the relative stability of our approach
efficiency and accuracy of a text classifier [12]. Using the χ2 with respect to the keyword set for the initial query especially
statistic, we select the top K f best words from each field. between LD50 and LD137 datasets. The LD1 dataset we
To select the appropriate number of words per field, we observe higher accuracy, but significantly less recall compared
followed the process below. We vary K f = 5,10,20,30,40 and to LD137. We attribute this fact to the single keyword used in
50 for title, topic and README file, and we find that the top selecting the repositories in LD1, which may have lead to a
30 words in title, 10 words in topic and 10 words in README more homogeneous group of repositories. Interestingly, LD50
file exhibit the highest accuracy. Similarly, we try K f = 80, seem to have the lower recall and F1-score even though the
90, 100, 110 and 120 for file names and K f = 300, 325, 350, differences are not that large.
154 23rd International Symposium on Research in Attacks, Intrusions and Defenses USENIX Association
Precision Recall F1-score ular malware type. Opting for diversity and coverage, the
dataset spans all the identified types: virus, backdoor, botnet,
98%
keylogger, worm, ransomware, rootkit, trojan, spyware, spoof,
89%
89%
87%
87%
86%
82%
ddos, sniff, spam, and cryptominer. We intend to constantly
79%
77%
update and make our labeled malware repositories publicly
available [28].
USENIX Association 23rd International Symposium on Research in Attacks, Intrusions and Defenses 155
R Author # # # Content of the Reposi- Target Platform
Types
ID Star Fork Watcher tory Wind. Linux Mac IoT Andr. iOS Total
1 ytisf 4851 1393 730 80 malware source code Total 1592 1365 380 108 442 131 4018
and 140 Binaries
key- 396 209 42 2 27 3 679
2 n1nj4sec 4811 1307 440 Pupy RAT logger
3 Screetsec 3010 1135 380 TheFatRat Backdoor back- 181 227 37 11 51 4 511
4 malwaredllc 2515 513 268 Byob botnet door
5 RoganDawes 2515 513 268 USB attack platform virus 235 131 34 2 51 16 469
6 Visgean 626 599 127 Zeus trojan horse botnet 153 154 43 36 64 17 467
7 Ramadhan 535 283 22 30 malware samples trojan 133 70 24 16 67 19 329
8 dana-at-cp 1320 513 125 backdoor-apk backdoor spoof 76 115 88 2 20 9 310
rootkit 55 163 13 2 19 3 255
Table 5: The profile of the top 5 most influential malware ransom- 117 67 14 1 33 13 245
repositories across all three metrics with 8 unique repositories. ware
ddos 71 95 20 10 9 3 208
worm 61 45 18 5 25 18 172
spyware 45 22 6 6 38 16 133
stars, forks, and watchers. We present a short profile of these spam 40 29 18 14 23 5 129
dominant repositories in Table 5. Most of the repositories sniff 29 38 23 1 15 5 111
contain a single malware project, which is an established
practice among the authors in GitHub [48,66]. We find that the Table 6: Distribution of the malware repositories from M137
repository "theZoo" [72], created by ytisf in 2014 is the most dataset based on the malware type and malware target plat-
forked, watched, and starred repository with 1393 forks, 730 form. This table demonstrates the repositories that fit with the
watchers and 4851 stars as of October, 2019. However, this criteria defined in Section 6.
repository is quite unique and was created with the intention
of being a malware database with 140 binaries and 80 source
code repositories. our temporal analysis.
Influence metrics are correlated: As one would expect, b. Windows and Linux are the most popular targets.
the influence and popularity metrics are correlated. We use Not surprisingly, we find that the majority of the malware
a common correlation metric, the Pearson Correlation Coef- repositories are affiliated with these two platforms: 1592
ficient (r) [6], measured in a scale of [−1, 1]. We calculate repositories for Windows, and 1365 for Linux.
the metric for all pairs of our three popularity metrics. We c. MacOS-focused repositories: fewer, but they exist.
find that all of them exhibit higher positive correlation: stars Although MacOS platforms are less commonly targeted, we
vs. forks (r = 0.92, p < 0.01), forks vs. watchers (r = 0.91, find a non-trivial number of malware repositories for MacOS.
p < 0.01) and watchers vs. stars (r = 0.91, p < 0.01). As shown in Figure 4c, there are 380 MacOS malware reposi-
B. Malware type and target platform. We wanted to get tories, which is roughly an order of magnitude less compared
a feel for what type of malware we have identified. As a to those for Windows and Linux.
first approximation, we use the keywords found in the text C. Temporal analysis. We want to study the evolution and
fields to relate repositories in M137 with the type of malware the trends of malware repositories. We plot the number of
and the intended target platform. Our goal is to create the new malware repositories per year: a) total malware, b) per
two-dimensional distribution per malware type and the target type of malware, and c) per target platform in Figure 4. We
platform as shown in Table 6. To create this table, we associate discuss a few interesting temporal behaviors below.
a repository with keywords in its title, topics, descriptions, file a. The number of new malware repositories more than
names and README file fields of: (a) the 6 target platforms, triples every four years. We see an alarming increase from
and (b) the 13 malware type keywords. 117 malware repositories in 2010 to 620 repositories in 2014
How well does this heuristic approach work? We provide and to 2166 repositories in 2018. We also observe a sharp
two different indications of its relative effectiveness. First, increase of 70% between 2015 to 2016 shown in Figure 4a.
the vast majority of the repositories relate to one platform or b. Keyloggers started a super-linear growth since 2010
type of malware: (a) less than 8% relate to more than one and are by far affiliated with the most new repositories per
platform, and (b) less than 11% relate to more than one type year since 2013, but their rate of growth reduced in 2018.
of malware. Second, we manually verify the 250 repositories c. Ransomware repositories emerge in 2014 and gain
in our curated data MCur and find a 98% accuracy. momentum in 2017. Ransomware experienced their highest
Below, we provide some observations from Table 6. growth rate in 2017 with 155 new repositories, while that
a. Keyloggers reign supreme. We see that one of the number dropped to 103 in 2018.
largest categories is the keylogger malware with 679 reposi- d. Malware activity slowed down in 2018 across the
tories, which are mostly affiliated with Windows and Linux board. It seems that 2018 is a slower year for all malware
platforms. We discuss the emergence of keyloggers below in even when seen by type ( Figure 4b) and target platform (Fig-
156 23rd International Symposium on Research in Attacks, Intrusions and Defenses USENIX Association
400
2500
2166 backdoor
300
1927 botnet
2000 ddos 300
Number of Repository
Number of Repository
1648
Number of Repository
keylogger windows
1500 200 ransomw linux
are 200
mac
973 rootkit
1000 android
sniff
620 iphone
100 spam
100 iot
347 spoof
500 245
166
53 117 spyware
trojan
0 0
2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 0 worm 2008 2010 2012 2014 2016 2018
2008 2010 2012 2014 2016 2018
Year Year virus Year
(a) New malware repositories created (c) New malware repositories per target
(b) New repositories per type of malware per year.
per year. platform per year.
Figure 4: New malware repositories per year: a) all malware, b) per type of malware, and c) per target platform.
USENIX Association 23rd International Symposium on Research in Attacks, Intrusions and Defenses 157
ing two indications. First, we find direct connections between set of 137 keywords that we use. However, we are encouraged
the usernames across different platforms. For example, user by the number and the reasonable diversity of the retrieved
3vilp4wn at the "Hack This Site" forum is promoting a key- repositories we see in Table 6.
logger malware by referring to a GitHub repository [1] whose c. Are our datasets representative? This is the typical
author has the same username. Second, these usernames are hard question for any measurement or data collection study.
fairly uncommon, which increases the likelihood of belong- First of all, we want to clarify that our goal is to create a large
ing to the same person. For example, there is a GitHub user database of malware source code. So, in that regard, we claim
with the name fahimmagsi, and someone with the same user- that we accomplished our mission. At the same time, we seem
name is boasting about their hacking successes in the "Ethical to have a fair number of malware samples in each category of
Hacker" forum. As we will see below, fahimmagsi seems to interest, as we see in Table 6.
have a well-established online reputation. Studying the trends of malware is a distant second goal,
b. "Googling" usernames reveals significant hacking which we present with the appropriate caveat. On the one
activities. Given that these GitHub usernames are fairly hand, we are limited by GitHub’s API operation, as we dis-
unique, it was natural to look them up on the web at large. cussed earlier. On the other hand, we attempt to reduce the
Even a simple Internet search with the usernames reveals biases that are under our control. To ensure some diversity
significant hacking activities, including hacking websites or among our malware, we added as many words as we could
social networks, and offering hacking tutorials in YouTube. in our 137 malware, which is likely to capture a wide range
We investigate the top 40 most prolific malware authors of malware types. We argue that the fairly wide breadth of
using a web search with a single simple query: "hacked by malware types in Table 6 is a good indication. Note that our
<username>". We then examine only the first page of search curated dataset MCur with 250 malware is reasonably repre-
results. Despite all these self-imposed restrictions, we identify sentative in terms of coverage.
three users with substantial hacking related activities across d. What is the overlap among the identified reposito-
Internet. For example, we find a number of news articles for ries? Note that our repository does not include forked reposi-
hacking a series of websites by GitHub users fahimmagsi and tories, since GitHub does not return forked repositories as
CR4SH [65] [15]. Moreover, we find user n1nj4sec sharing a answers to a query. Similarly, the breadth of the types of the
multi-functional Remote Access Trojan (RAT) named "Pupy", malware as shown in Table 6 hints at a reasonable diversity.
developed by her, which received significant news coverage However, our tool cannot claim that the identified repositories
in security articles back in March of 2019 [46] [54]. We are are distinct nor is it attempting do so. GitHub does not restrict
confident that well-crafted and targeted searches can connect authors from copying (downloading), and uploading it as a
more malware authors with hacking activities and usernames new repository. In the future, we intend to study the similarity
in other online forums. and evolution among these repositories.
e. Are the authors of repositories the original creator
8 Discussion of the source code? This is an interesting and complex ques-
We discuss the effectiveness and limitations of SourceFinder. tion that goes beyond the scope of this work. Identifying the
a. Why is malware publicly available in the first place? original creator will require studying the source code of all
Our investigation in Section 7 provides strong indications that related repositories, and analyzing the dynamics of the hacker
malware authors want to actively establish their hacking repu- authors, which we intend to do in the future.
tation. It seems that they want to boost their online credibility, f. Are all the malware authors malicious? Not necessar-
which often translates to money. Recent works [18, 51, 57] ily. This is an interesting question, but it is not central to the
study the underground markets of malware services and tools: main point of our work. On the one hand, we find some white
it stands to reason that notorious hackers will attract more hackers or researchers, such as Yuval Nativ [74], or Nicolas
clients. At the same time, GitHub acts as a collaboration plat- Verdier [47]. On the other hand, several authors seem to be
form, which can help hackers improve their tools. malicious, as we saw in Section 7.
b. Do we identify every malware repository in GitHub? g. Are our malware repositories in "working order"?
Our tool can not guarantee that it will identify every malware It is hard to know for sure, but we attempt to answer indi-
repository in GitHub. First, we can only identify reposito- rectly. First, we pick 30 malware source codes and all of them
ries that "want to be found": (a) they must be public, and compiled and a subset of 15 of them actually run successfully
(b) they must be described with the appropriate text and key- in an emulated environment as we already mentioned. Sec-
words. Clearly, if the author wants to hide her repository, we ond, these public repositories are a showcase for the skills of
won’t be able to find it. However, we argue that this defeats the author, who will be reluctant to have repositories of low
the purpose of having a public archive: if secrecy was de- quality. Third, public repositories, especially popular ones,
sired, the code would have been shared through private links are inevitably scrutinized by their followers.
and services. Second, our approach is constrained by GitHub h. Can we handle evasion efforts? Our goal is to create
querying limitations, which we discussed in Section 3, and the the largest malware source-code database possible and having
158 23rd International Symposium on Research in Attacks, Intrusions and Defenses USENIX Association
collected 7504 malware repositories seems like a great start. and are rarely updated such as project theZoo [72]. To the
In the future, malware authors could obfuscate their reposi- best of our knowledge, there does not exist an active archive
tories by using misleading titles, and description, and even of malware source code, where malware research community
filenames. We argue that authors seem to want their repos- can get an enough number of source code to analyze.
itories to be found, which is why they are public. We also d. Databases of malware binaries: There are well estab-
have to be clear: it is easy for the authors to hide their reposi- lished malware binary collection initiatives, such as Virus-
tories, and they could start by making them private or avoid total [68] which provides analysis result for a malware bi-
GitHub altogether. However, both these moves will diminish nary. There are also community based projects such as Virus-
the visibility and "street-cred" of the authors. Bay [69] that serve as malware binary sharing platform.
i. Will our approach generalize to other archives? We e. Converting binaries to source code: A complementary
believe that SourceFinder can generalize to other archives, approach is to try to generate the source code from the binary,
which provide public repositories, like GitLab and BitBucket. but this is a very hard task. Some works [19, 20] focus on
We find that these sites allow public repositories and let the reverse engineering of the malware binary to a high-level
users retrieve repositories. We have also seen equivalent data language representation, but not source code. Some other ef-
fields (title, description, etc). Therefore, we are confident that forts [11,29,59] introduce binary decompilation into readable
our approach can work with other archives. source code. However, malware authors use sophisticated
obfuscation techniques [56] [10, 73] to make it difficult to
9 Related Work reverse engineer a binary into source code.
There are several works that attempt to determine if a piece of f. Measuring and modeling hacking activity. Some other
software is malware, usually focusing on a binary, using static studies analyze the underground black market of hacking
or dynamic analysis [4, 17, 36, 60]. However, to the best of activities but their starting point is security forums [18,51,57],
our knowledge, no previous study has focused on identifying and as such they study the dynamics of that community but
malware source code in public software archives, such as without retrieving any malware code.
GitHub, in a systematic manner as we do in this work. We
highlight the related works in the following categories: 10 Conclusion
a. Studies that need malware source code. Several stud- Our work capitalizes on a great missed opportunity: there are
ies [40, 62, 78] use malware source code that are manually thousands of malware source code repositories on GitHub.
retrieved from GitHub repositories. Some studies [8] [9] com- At the same time, there is a scarcity of malware source code,
pare the evolution and the code reuse of 150 malware source which is necessary for certain research studies.
codes (with only some from GitHub) with that of benign soft- Our work is arguably the first to develop a systematic ap-
ware from a software engineering perspective and study the proach to extract malware source-code repositories at scale
code reuse. Overall, various studies [22, 32] can benefit from from GitHub. Our work provides two main tangible outcomes:
malware source code to fine-tune their approach. (a) we develop SourceFinder, which identifies malware repos-
b. Mining and analyzing GitHub: Many studies have an- itories with 89% precision, and (b) we create, possibly, the
alyzed different aspects of GitHub, but not with the intention largest non-commercial malware source code archive with
of retrieving malware repositories. First, there are efforts that 7504 repositories. Our large scale study provide some interest-
study the user interactions and collaborations on GitHub and ing trends for both the malware repositories and the dynamics
their relationship to other social media in [30, 37, 50]. Second, of the malware authors.
some efforts discuss the challenges in extracting and analyz- We intend to open-source both SourceFinder and the
ing data from GitHub with respect to sampling biases [14,27]. database of malware source code to maximize the impact of
Other works [34, 35] study how users utilize the various fea- our work. Our ambitious vision is to become the authoritative
tures and functions of GitHub. Several studies [31, 53, 67] dis- source for malware source code for the research community
cuss the challenges of mining software archives, like Source- by providing tools, databases, and benchmarks.
Forge and GitHub, arguing that more information is required
to make assertions about users and software projects. Finally, Acknowledgements
some efforts [61, 63, 76, 77] study GitHub repositories, but This work was supported by the UC Multicampus National
they focus on establishing a systematic method for identify- Lab Collaborative Research and Training (UC NL CRT)
ing similarities, and use it to identify classes of repositories award #LFR18548554.
(e.g. Android versus web applications). Most of these studies
use topic modeling, which is one of the approaches that we References
considered initially, but gave poor results in our context, but
we will revisit in the future. [1] 3vilp4wn. Hacking tool of 3vilp4wn. https://github.
c. Databases of malware source code: At the time of writ- com/3vilp4wn/CryptLog/. [Online; accessed 08-
ing this paper, there are few malware source code databases February-2020].
USENIX Association 23rd International Symposium on Research in Attacks, Intrusions and Defenses 159
[2] Kevin Allix, Tegawendé F Bissyandé, Jacques Klein, [13] Chris Stobing. ios malwares in 2014.
and Yves Le Traon. Androzoo: Collecting millions https://www.digitaltrends.com/computing/
of android apps for the research community. In 2016 decrypt-2014-biggest-year-malware-yet/.
IEEE/ACM 13th Working Conference on Mining Soft- [Online; accessed 08-February-2020].
ware Repositories (MSR), pages 468–471. IEEE, 2016.
[14] Valerio Cosentino, Javier Luis Cánovas Izquierdo, and
[3] Daniel Arp, Michael Spreitzenbarth, Malte Hubner, Jordi Cabot. Findings from github: methods, datasets
Hugo Gascon, Konrad Rieck, and CERT Siemens. and limitations. In 2016 IEEE/ACM 13th Working Con-
Drebin: Effective and explainable detection of android ference on Mining Software Repositories (MSR), pages
malware in your pocket. In Ndss, volume 14, pages 137–141. IEEE, 2016.
23–26, 2014.
[15] CR4SH. Hacking tool of cr4sh. https://github.
[4] John Aycock. Computer viruses and malware, vol- com/Cr4sh/s6_pcie_microblaze/. [Online; ac-
ume 22. Springer Science & Business Media, 2006. cessed 08-February-2020].
[5] Andrew Begel, Jan Bosch, and Margaret-Anne Storey.
[16] Laura Dabbish, Colleen Stuart, Jason Tsay, and Jim
Social networking meets software development: Per-
Herbsleb. Social coding in github: transparency and
spectives from github, msdn, stack exchange, and top-
collaboration in an open software repository. In Pro-
coder. IEEE Software, 30(1):52–66, 2013.
ceedings of the ACM 2012 conference on computer
[6] Jacob Benesty, Jingdong Chen, Yiteng Huang, and Is- supported cooperative work, pages 1277–1286. ACM,
rael Cohen. Pearson correlation coefficient. In Noise 2012.
reduction in speech processing, pages 1–4. Springer,
2009. [17] Anusha Damodaran, Fabio Di Troia, Corrado Aaron Vis-
aggio, Thomas H Austin, and Mark Stamp. A compari-
[7] Christopher M Bishop. Pattern recognition and machine son of static, dynamic, and hybrid analysis for malware
learning. Springer, 2006. detection. Journal of Computer Virology and Hacking
Techniques, 13(1):1–12, 2017.
[8] Alejandro Calleja, Juan Tapiador, and Juan Caballero.
A look into 30 years of malware development from a [18] Ashok Deb, Kristina Lerman, Emilio Ferrara, Ashok
software metrics perspective. In International Sympo- Deb, Kristina Lerman, and Emilio Ferrara. Predicting
sium on Research in Attacks, Intrusions, and Defenses, Cyber-Events by Leveraging Hacker Sentiment. Infor-
pages 325–345. Springer, 2016. mation, 9(11):280, nov 2018.
[9] Alejandro Calleja, Juan Tapiador, and Juan Caballero.
[19] Lukás Ďurfina, Jakub Křoustek, and Petr Zemek. Psybot
The malsource dataset: Quantifying complexity and
malware: A step-by-step decompilation case study. In
code reuse in malware development. IEEE Transactions
2013 20th Working Conference on Reverse Engineering
on Information Forensics and Security, 14(12):3175–
(WCRE), pages 449–456. IEEE, 2013.
3190, 2018.
[10] Gengbiao Chen, Zhengwei Qi, Shiqiu Huang, Kangqi [20] Lukáš Ďurfina, Jakub Křoustek, Petr Zemek, Dušan
Ni, Yudi Zheng, Walter Binder, and Haibing Guan. A re- Kolář, Tomáš Hruška, Karel Masařík, and Alexander
fined decompiler to generate c code with high readability. Meduna. Design of a retargetable decompiler for a
Software: Practice and Experience, 43(11):1337–1358, static platform-independent malware analysis. In Inter-
2013. national Conference on Information Security and Assur-
ance, pages 72–86. Springer, 2011.
[11] Gengbiao Chen, Zhuo Wang, Ruoyu Zhang, Kan Zhou,
Shiqiu Huang, Kangqi Ni, and Zhengwei Qi. A novel [21] Michalis Faloutsos, Petros Faloutsos, and Christos
lightweight virtual machine based decompiler to gener- Faloutsos. On power-law relationships of the internet
ate c/c++ code with high readability. School of Software, topology. ACM SIGCOMM computer communication
Shanghai Jiao Tong University, Shanghai, China, 11, review, 29(4):251–262, 1999.
2010.
[22] Sri Shaila G, Ahmad Darki, Michalis Faloutsos, Nael
[12] Jingnian Chen, Houkuan Huang, Shengfeng Tian, and Abu-Ghazaleh, and Manu Sridharan. Idapro for iot mal-
Youli Qu. Feature selection for text classification ware analysis? In 12th USENIX Workshop on Cyber
with naïve bayes. Expert Systems with Applications, Security Experimentation and Test (CSET 19), Santa
36(3):5432–5435, 2009. Clara, CA, August 2019. USENIX Association.
160 23rd International Symposium on Research in Attacks, Intrusions and Defenses USENIX Association
[23] Joobin Gharibshah, Evangelos E Papalexakis, and The promises and perils of mining github. In Proceed-
Michalis Faloutsos. Rest: A thread embedding ap- ings of the 11th working conference on mining software
proach for identifying and classifying user-specified repositories, pages 92–101. ACM, 2014.
information in security forums. arXiv preprint
arXiv:2001.02660, 2020. [35] Eirini Kalliamvakou, Georgios Gousios, Kelly Blincoe,
Leif Singer, Daniel M German, and Daniela Damian.
[24] GitHub. Repository search for public reposito- An in-depth study of the promises and perils of mining
ries: Showing 32,107,794 available repository results. github. Empirical Software Engineering, 21(5):2035–
https://github.com/search?q=is:public/. [On- 2071, 2016.
line; accessed 13-October-2019].
[36] Clemens Kolbitsch, Paolo Milani Comparetti, Christo-
[25] GitHub. User search: Showing 34,149,146 available pher Kruegel, Engin Kirda, Xiao-yong Zhou, and Xi-
users. https://github.com/search?q=type:user& aoFeng Wang. Effective and efficient malware detection
type=Users/. [Online; accessed 13-October-2019]. at the end host. In USENIX security symposium, vol-
ume 4, pages 351–366, 2009.
[26] Amir Globerson, Gal Chechik, Fernando Pereira, and
Naftali Tishby. Euclidean embedding of co-occurrence [37] Bence Kollanyi. Automation, algorithms, and politics:
data. Journal of Machine Learning Research, Where do bots come from? an analysis of bot codes
8(Oct):2265–2295, 2007. shared on github. International Journal of Communica-
tion, 10:20, 2016.
[27] Georgios Gousios and Diomidis Spinellis. Mining soft-
ware engineering data from github. In 2017 IEEE/ACM [38] Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Wein-
39th International Conference on Software Engineering berger. From word embeddings to document distances.
Companion (ICSE-C), pages 501–502. IEEE, 2017. In International conference on machine learning, pages
957–966, 2015.
[28] HackerChatter. UCR hacker forum webtool for extract-
ing useful information from security forums! http: [39] Michael J Lee, Bruce Ferwerda, Junghong Choi, Jungpil
//www.hackerchatter.org/. [Online; accessed 22- Hahn, Jae Yun Moon, and Jinwoo Kim. Github devel-
July-2020]. opers use rockstars to overcome overflow of news. In
CHI’13 Extended Abstracts on Human Factors in Com-
[29] Richard Healey. Source code extraction via monitoring puting Systems, pages 133–138. ACM, 2013.
processing of obfuscated byte code, August 27 2019.
US Patent 10,394,554. [40] Toomas Lepik, Kaie Maennel, Margus Ernits, and Olaf
Maennel. Art and automation of teaching malware
[30] Sameera Horawalavithana, Abhishek Bhattacharjee, reverse engineering. In International Conference on
Renhao Liu, Nazim Choudhury, Lawrence O Hall, and Learning and Collaboration Technologies, pages 461–
Adriana Iamnitchi. Mentions of security vulnerabili- 472. Springer, 2018.
ties on reddit, twitter and github. In IEEE/WIC/ACM
International Conference on Web Intelligence, pages [41] Yitan Li, Linli Xu, Fei Tian, Liang Jiang, Xiaowei
200–207. ACM, 2019. Zhong, and Enhong Chen. Word embedding revisited:
A new representation learning and explicit matrix fac-
[31] James Howison and Kevin Crowston. The perils and torization perspective. In Twenty-Fourth International
pitfalls of mining sourceforge. In MSR, pages 7–11. IET, Joint Conference on Artificial Intelligence, 2015.
2004.
[42] Michael Frederick McTear, Zoraida Callejas, and David
[32] James A Jerkins. Motivating a market or regulatory Griol. The conversational interface, volume 6. Springer,
solution to iot insecurity with the mirai botnet code. In 2016.
2017 IEEE 7th Annual Computing and Communication
Workshop and Conference (CCWC), pages 1–5. IEEE, [43] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
2017. Dean. Efficient estimation of word representations in
vector space. arXiv preprint arXiv:1301.3781, 2013.
[33] Anjali Ganesh Jivani et al. A comparative study of stem-
ming algorithms. Int. J. Comp. Tech. Appl, 2(6):1930– [44] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
1938, 2011. rado, and Jeff Dean. Distributed representations of
words and phrases and their compositionality. In Ad-
[34] Eirini Kalliamvakou, Georgios Gousios, Kelly Blincoe, vances in neural information processing systems, pages
Leif Singer, Daniel M German, and Daniela Damian. 3111–3119, 2013.
USENIX Association 23rd International Symposium on Research in Attacks, Intrusions and Defenses 161
[45] Kevin P Murphy. Machine learning: a probabilistic [57] Anna Sapienza, Sindhu Kiranmai Ernala, Alessandro
perspective. MIT press, 2012. Bessi, Kristina Lerman, and Emilio Ferrara. Discover:
Mining online chatter for emerging cyber threats. In
[46] n1nj4sec. Pupy tool. https://github.com/ Companion Proceedings of the The Web Conference
n1nj4sec/pupy/wiki/. [Online; accessed 08- 2018, WWW ’18, pages 983–990, Republic and Canton
February-2020]. of Geneva, Switzerland, 2018. International World Wide
Web Conferences Steering Committee.
[47] Nicolas Verdier. Security researcher. https://www.
linkedin.com/in/nicolas-verdier-b23950b6/. [58] Tobias Schnabel, Igor Labutov, David Mimno, and
[Online; accessed 14-February-2020]. Thorsten Joachims. Evaluation methods for unsuper-
vised word embeddings. In Proceedings of the 2015
[48] Nikhil Gupta. Should we create a separate git repository conference on empirical methods in natural language
of each project or should we keep multiple projects in a processing, pages 298–307, 2015.
single git repo? https://www.quora.com/. [Online;
[59] Eric Schulte, Jason Ruchti, Matt Noonan, David Ciar-
accessed 14-February-2020].
letta, and Alexey Loginov. Evolving exact decompila-
[49] Jeffrey Pennington, Richard Socher, and Christopher D tion. In Workshop on Binary Analysis Research (BAR),
Manning. Glove: Global vectors for word representa- 2018.
tion. In Proceedings of the 2014 conference on empiri- [60] Madhu K Shankarapani, Subbu Ramamoorthy, Ram S
cal methods in natural language processing (EMNLP), Movva, and Srinivas Mukkamala. Malware detection
pages 1532–1543, 2014. using assembly and api call sequences. Journal in com-
puter virology, 7(2):107–119, 2011.
[50] Daniel Pletea, Bogdan Vasilescu, and Alexander Sere-
brenik. Security and emotion: sentiment analysis of [61] Abhishek Sharma, Ferdian Thung, Pavneet Singh
security discussions on github. In Proceedings of the Kochhar, Agus Sulistya, and David Lo. Cataloging
11th working conference on mining software reposito- github repositories. In Proceedings of the 21st Inter-
ries, pages 348–351. ACM, 2014. national Conference on Evaluation and Assessment in
Software Engineering, pages 314–319, 2017.
[51] Rebecca S. Portnoff, Sadia Afroz, Greg Durrett,
[62] Victor RL Shen, Chin-Shan Wei, and Tony Tong-Ying
Jonathan K. Kummerfeld, Taylor Berg-Kirkpatrick, Da-
Juang. Javascript malware detection using a high-level
mon McCoy, Kirill Levchenko, and Vern Paxson. Tools
fuzzy petri net. In 2018 International Conference on
for Automated Analysis of Cybercriminal Markets. Pro-
Machine Learning and Cybernetics (ICMLC), volume 2,
ceedings of the 26th International Conference on World
pages 511–514. IEEE, 2018.
Wide Web - WWW ’17, pages 657–666, 2017.
[63] Marcus Soll and Malte Vosgerau. Classifyhub: an
[52] PyGithub. A python libraray to use github api v3. algorithm to classify github repositories. In Joint
https://github.com/PyGithub/PyGithub/. [On- German/Austrian Conference on Artificial Intelligence
line; accessed 13-October-2019]. (Künstliche Intelligenz), pages 373–379. Springer, 2017.
[53] Austen Rainer and Stephen Gale. Evaluating the quality [64] SL Ting, WH Ip, and Albert HC Tsang. Is naive bayes a
and quantity of data on open source software projects. good classifier for document classification. International
In Procs 1st int conf on open source software, 2005. Journal of Software Engineering and Its Applications,
5(3):37–46, 2011.
[54] Raj Chandel. Article on pupy.
https://www.hackingarticles.in/ [65] Tom K. Hacking news of fahim magsi.
command-control-tool-pupy/. [Online; accessed https://www.namepros.com/threads/
08-February-2020]. hacked-by-muslim-hackers.950924/. [Online;
accessed 08-February-2020].
[55] Monica Rogati and Yiming Yang. High-performing [66] Tommy Hodgins. Choosing between “one project per
feature selection for text classification. In Proceedings repository” vs “multiple projects per repository” archi-
of the eleventh international conference on Information tecture. https://hashnode.com/. [Online; accessed
and knowledge management, pages 659–661, 2002. 14-February-2020].
[56] Hassen Saıdi, Phillip Porras, and Vinod Yegneswaran. [67] Christoph Treude, Larissa Leite, and Maurício Aniche.
Experiences in malware binary deobfuscation. Virus Unusual events in github repositories. Journal of Sys-
Bulletin, 2010. tems and Software, 142:237–247, 2018.
162 23rd International Symposium on Research in Attacks, Intrusions and Defenses USENIX Association
[68] Virus Total. Free online virus, malware and url scan- [75] Yin Zhang, Rong Jin, and Zhi-Hua Zhou. Understanding
ner. https://www.virustotal.com/en. [Online; ac- bag-of-words model: a statistical framework. Interna-
cessed 08-February-2020]. tional Journal of Machine Learning and Cybernetics,
1(1-4):43–52, 2010.
[69] VirusBay. A web-based, collaboration platform for
malware researcher. https://beta.virusbay.io/.
[Online; accessed 08-February-2020].
[76] Yu Zhang, Frank F Xu, Sha Li, Yu Meng, Xuan Wang,
[70] Wikipedia. Linux based botnet bashlite. https://en. Qi Li, and Jiawei Han. Higitclass: Keyword-driven
wikipedia.org/wiki/BASHLITE/. [Online; accessed hierarchical classification of github repositories. In
08-February-2020]. 2019 IEEE International Conference on Data Mining
(ICDM), pages 876–885. IEEE, 2019.
[71] Shuo Xu. Bayesian naïve bayes classifiers to text classi-
fication. Journal of Information Science, 44(1):48–59,
2018.
[77] Yun Zhang, David Lo, Pavneet Singh Kochhar, Xin Xia,
[72] Y. Nativ and S. Shalev. thezoo: A live malware reposi- Quanlai Li, and Jianling Sun. Detecting similar repos-
tory. https://github.com/ytisf/theZoo. [Online; itories on github. In 2017 IEEE 24th International
accessed 08-February-2020]. Conference on Software Analysis, Evolution and Reengi-
neering (SANER), pages 13–23. IEEE, 2017.
[73] Khaled Yakdan, Sergej Dechand, Elmar Gerhards-
Padilla, and Matthew Smith. Helping johnny to analyze
malware: A usability-optimized decompiler and mal-
ware analysis user study. In 2016 IEEE Symposium on [78] Xingsi Zhong, Yu Fu, Lu Yu, Richard Brooks, and G Ku-
Security and Privacy (SP), pages 158–177. IEEE, 2016. mar Venayagamoorthy. Stealthy malware traffic-not as
innocent as it looks. In 2015 10th International Con-
[74] Yuval Nativ. Security researcher. https://morirt. ference on Malicious and Unwanted Software, pages
com/. [Online; accessed 14-February-2020]. 110–116. IEEE, 2015.
USENIX Association 23rd International Symposium on Research in Attacks, Intrusions and Defenses 163