Synthetic Minority Over-Sampling Technique (Smote) For Predicting Software Build Outcomes
Synthetic Minority Over-Sampling Technique (Smote) For Predicting Software Build Outcomes
Abstract— In this research we use a data stream approach to categorized as success or failure. As the volume of data
mining data and construct Decision Tree models that predict associated with each build is large only a limited number of
software build outcomes in terms of software metrics that are build instances are actually stored in the repository on a first-in,
derived from source code used in the software construction first-out basis. Traditional data mining methods are tailored to
process. The rationale for using the data stream approach was to static data environments where the data is retained. The first
track the evolution of the prediction model over time as builds major challenge in mining software repositories is dealing with
are incrementally constructed from previous versions either to dynamic data that arrives on a continuous basis. Our previous
remedy errors or to enhance functionality. As the volume of data work [21] addressed this challenge by modeling the
available for mining from the software repository that we used
development process as a data stream to deal with software
was limited, we synthesized new data instances through the
application of the SMOTE oversampling algorithm. The results
project data that is produced continuously and accumulated
indicate that a small number of the available metrics have over a period of time before being discarded. The data stream
significance for prediction software build outcomes. It is observed mining approach was shown to be effective at maintaining
that classification accuracy steadily improves after knowledge related to the development project even after the
approximately 900 instances of builds have been fed to the data is discarded.
classifier. At the end of the data streaming process classification This paper attempts to extend our work to address the
accuracies of 80% were achieved, though some bias arises due to
second challenge, namely to deal with the limited volume of
the distribution of data across the two classes over time.
data and improve the accuracy of the prediction event. In order
Keywords- SMOTE, Data Stream Mining, Jazz, Software to boost the training power of the limited quantity of available
Metrics, Software Repositories. data the SMOTE [2] oversampling algorithm was applied to
synthesize new data instances from the available instances prior
I. INTRODUCTION to inducing a decision tree model implemented via the
Hoeffding [3] tree method. For the simulation to be realistic the
The Mining Software Repositories (MSR) field analyses the
naturally occurring distribution of successful and failed build
rich data available in software repositories to uncover
instances occurring in the original population was maintained
interesting and actionable information about software systems
in the oversampling process.
and projects. This enables researchers to reveal interesting
patterns and information about the development of software The dynamic nature of software and the resulting changes
systems. MSR has been a very active research area since 2004 in software development strategies over time causes changes in
[4]. Until the emergence of MSR as a research endeavor, the the patterns that govern software project outcomes. This
data from software repositories were mostly used as historical phenomenon has been recognized in many other domains and
records for supporting development activities. Analysis of is referred to as concept drift. Changes in a data stream can
MSR research has shown that the approach of extracting evolve slowly or quickly and rates of change can be queried
knowledge from a repository has the potential to be a valuable within stream-based tools. This paper describes an attempt to
method for analyzing the software development process for improve build outcome prediction accuracies for the Jazz
many domains [5]. However there are a number of data related project by synthetically creating data to boost the training
challenges, one of which is how to deal with repositories that power of the data stream mining approach while taking into
contain insufficient or imbalanced data because the project is account concept drift that occurs as part of the stream.
still immature or because the development environment is such
that data is discarded over time. II. BACKGROUND AND RELATED WORK
Our previous work [21] developed an approach for This research draws from multiple areas to inform the
applying data stream mining techniques to overcome the direction of inquiry, in particular it is placed in the context of
challenge of discarded data utilizing the Jazz repository [1]. other research related to Mining Software Repositories
The Jazz repository stores data, including source code, related research, specifically in the context of the Jazz repository. In
to each software build attempt and retains the build outcome, addition, it uses experience gained applying data stream mining
and synthetic data generation in other domains to improve the short timescales. It therefore seems counter-intuitive to deploy
prediction models developed for the Jazz project. synthetic data generation techniques in conjunction with a data
stream mining approach. However, the discarding of data from
A. Mining the Jazz Repository Jazz does not encourage the use of static classification
The Jazz development environment has been recognized as approaches in practice, even though such approaches can be
offering new opportunities in terms of MSR research because it deployed on any given snapshot of the repository [22].
integrates the software source code archive and bug database Deploying a data stream method in conjunction with synthetic
by linking bug reports and source code changes with each other data generation allows a consistent approach to be used in
[6]. Whilst this provides much potential in gaining valuable practice for new projects. Synthetic data can be generated from
insights into the development process of software projects, the limited quantity of actual data that is available in the early
such potential is yet to be fully realized. To date, much of the stages of development, and a gradual phasing out of such
work focused on the Jazz repository is related to predicting synthetic data can be carried out when larger volumes of real
build success, either through social network analysis [7] or data become available. Such consistency is important if data
source code metrics [21, 22]. As is common with much MSR mining approaches are to become useful to software
research, the goal of working with the Jazz repository is in line practitioners.
with a key direction identified in the field [23], which is the
transformation of software repositories from static record- Synthetic data generation has been a research area for some
keeping ones into active repositories in order to guide decision time, with the literature containing many examples of random
processes in modern software projects. or pseudo-random data generation [25]. However, the goal of
our research is such that synthetic data must be representative
B. Data Stream Mining of the real data and therefore a more refined generation
The mining of data streams has arisen as a necessity due to approach is required. Such approaches include DataBoost-IM
advances in hardware/software that have enabled the capture of [26], ADASYN [27] and SMOTE [2] to name but a few. Many
different measurements of data in a wide range of fields [24]. of these approaches are based on similar sampling algorithms
Data streams are typically generated continuously and have and in this work we have elected to apply the standard SMOTE
very high fluctuating data rates. The storage, querying and algorithm as it has been effectively applied in many domains.
mining of such data sets are computationally challenging tasks
[24]. Research problems and challenges that have been arisen III. THE JAZZ DATASET
in mining data streams can be solved using well-established IBM Jazz is a fully integrated software development tool
statistical and computational approaches that can be that automatically captures software development processes
categorized as either data-based or task-based ones. In data- and artifacts. The Jazz repository contains real-time evidence
based solutions, only a subset of the whole dataset is examined that allows researchers to gain insights into team collaboration
or the data is transformed to an approximate smaller size and development activities within software engineering
representation. Task-based solutions involve applying projects [1, 7]. The Jazz repository artifacts include work items,
techniques from computational theory to achieve time and build items, change sets, source code files, authors and
space efficient solutions. Data-based solutions include comments. A work item is a description of a unit of work,
Sampling, Load Shredding, Sketching and Aggregation. Task- which is categorized as a task, enhancement or defect. A build
based solutions include Approximation Algorithms and Sliding item is compiled software to form a working unit. A change set
Window approaches, all of which have received considerable is a collection of code changes in a number of files. In Jazz a
attention by researchers [28]. The discarding of data from the change set is created by one author only and relates to one
Jazz environment and the relatively low data rate lends itself to work item. A single work item may contain many change sets.
a Sliding Window solution. Source code files are included in change sets and over time can
be related to multiple change sets.
In addition, various data mining approaches can be applied
to mining data streams, including clustering, frequency One of the challenges associated with working with the
counting and classification. The nature of the data, which Jazz repository is that the data contains holes and misleading
includes a classifiable attribute in terms of build outcome, lends elements which cannot be removed or identified easily. This is
itself to a classification method. In this work, we have applied because the Jazz environment has been used within the
the Hoeffding tree incremental learner in conjunction with the development of itself; therefore many features provided by Jazz
Adaptive Sliding Window (ADWIN) concept drift detector. were not implemented at early stages of the project. This
ADWIN is a parameter-free adaptive sliding window drift sparseness of the data has driven the decision to focus on using
detector that compares all adjacent sub-windows in given data software metrics as the predictor attributes. Whilst features of
window in order to detect a concept drift point [29]. This the Jazz environment may not have been present during early
method is recognized to produce high true positive and low phases of development, there has always been source code and
false positives rates while, having low detection delay times in therefore a consistent set of data can be created.
comparison to other drift detectors proposed in the data mining
literature [29]. IV. THEORETICAL FOUNDATIONS
Software metrics have been generated in order to deal with
C. Synthetic Data Generation
the sparseness of the data. Metric values can be derived from
Many of the challenges associated with data stream mining extracting development code from software repositories. Such
are related to dealing with high volumes of data in relatively metrics are commonly used within model-based project
management methods. Software metrics are used to measure generates more minority class samples for a classifier to learn
the complexity, quality and effort of a software development from, thereby allowing broader decision regions and coverage
project [8-12]. In the Jazz repository each software build [13]. SMOTE has been utilized within the software research
contains change sets that indicate the actual source code files community and compared with other sampling techniques in
that are modified during the implementation of the build. software quality modeling (random under-sampling, random
Source code metrics for each file are computed using the IBM oversampling, cluster-based oversampling and Borderline-
Software Analyzer tool. The builds after state was utilized in SMOTE) and has yielded encouraging results [5, 8]. SMOTE
order to ensure that the source code snapshot represented the has also been applied as a sampling strategy for software defect
actual software artifact that either failed or succeeded. prediction where data sets from NASA software project data
sets [10,16-18] and fault-prone module detection using the MIS
The Jazz repository consists of various types of software telecommunication systems [24]. For this work SMOTE is
builds. Included in this study were continuous builds (regular applied as a supervised instance filter using the Weka [19]
user builds), nightly builds (incorporating changes from the machine learning workbench.
local site) and integration builds (integrating components from
remote sites). As a result the following basic, average basic, In order to avoid the over-fitting problem while expanding
dependency, complexity, cohesion and Halstead software minority class regions SMOTE generates new instances by
metrics were derived from the source code files for each build: operating within the existing feature space. New instance
values are derived from interpolation rather than extrapolation,
Basic Software Metrics:
Number of Types Per Package, Number of Comments, Lines of Code, so they still carry relevance to the underlying data set. For each
Comment/Code Ratio, Number of Import Statements, Number of minority class instance SMOTE interpolates values using a k-
Interfaces, Number of Methods, Number of Parameters, Number of Lines, nearest neighbor technique and creates attribute values for new
Average Number of Attributes Per Class, Average Number of Constructors data instances [8]. For each minority data a new synthetic data
Per Class, Average Number of Comments, Average Lines of Code Per
Method, Average Number of Methods, Average Number of Parameters. instance (I) is generated by taking the difference between the
feature vector of I and its nearest neighbor (J) belonging to the
Dependency Metrics: same class, multiplying it by a random number between 0 and
Abstractness, Afferent Coupling, Efferent Coupling, Maintainability index, 1 and then adding it to I. This creates a random line segment
Instability, Normalized Distance.
between every pair of existing features from instances I and J,
Complexity Metrics: resulting in the creation of a new instance within the data set
Average Block Depth, Average Cyclomatic Complexity. [13]. This process is repeated for the other k-1 neighbors of the
Cohesion Metrics: minority instance I. As a result SMOTE generates more general
Lack of Cohesion 1 (LCOM1), Lack of Cohesion 2 (LCOM2), Lack of regions from the minority class and decision tree classifiers are
Cohesion 3 (LCOM3). able to use the data set for better generalizations.
Halstead Metrics: B. Hoeffding Tree
Number of Operands, Number of Operators, Number of Unique Operands,
Number of Unique Operators, Program Volume, Difficulty Level, Effort to The Hoeffding tree is an incremental decision tree
Implement, Number of Delivered Bugs, Time to Implement, Program induction method. Using the Hoeffding bound, it ascertains the
Length, Program Level, Program Vocabulary Size. number of instances that are needed to split a given (decision)
A. Synthetic Minority Over-sampling TEchnique (SMOTE) node of a tree and operates within a certain precision that can
be predetermined [3]. This method has potential in terms of
When working with real world data it is often found that
predicting future outcomes of software builds with high
data sets are heavily comprised of "normal" instances with only
accuracy while working with real-world data. Rather than using
a small percentage representing interesting findings. As a result
training and test sets, instances are represented as streams. The
the "abnormal" instances have a negative impact on a models'
Hoeffding tree is commonly used for classifying high speed
performance as they have a greater probability of
data streams. The algorithm that it uses generates a decision
misclassification using data mining methods [2, 13]. Data
tree from data incrementally by inspecting each instance within
instances that introduce noise within the data are often found
a stream without the need to store instances for later retrieval.
within the minority class [14, 15]. In order to overcome this
The tree resides in memory during each iteration and stores
limitation synthetically under-sampling the majority class may
information in its branches and leaves, potentially growing
improve a classifiers' performance. However, in doing so
from "learning" every new instance. The decision tree itself can
valuable data may be lost and model over-fitting may occur,
be inspected at any time during the streaming process. The
resulting in majority instances being wrongly classified as
quality of the tree itself is comparable to that used by
minority instances when new, unseen data is presented to the
traditional mining techniques, even though instances are
classifier model that was induced [14]. Another solution is to
introduced in an incremental manner.
provide the classifier with more complete regions within the
feature space via creation of new instances that are synthesized Just as with traditional decision tree learners, the Hoeffding
form existing data instances. tree is easy to interpret, making it easier to understand how the
model works. In addition to this, decision tree learners have
SMOTE enables a data miner to over sample the minority
proven to provide accurate solutions to a wide range of
class to achieve potentially better classifier performance
problems that are based on multi-dimensional data. For
without loss of data [2, 13]. While other over-sampling
Hoeffding trees each node of a decision tree undergoes a test
methods exist, such as Rippers Loss Ratio and Naive Bayes
which may result in it being split into two or more child nodes
methods, SMOTE provides better levels of performance as it
and sending each instance down a relevant branch to its that the there is potential for the accuracy of prediction to
destination child node, depending on the values of its attributes. improve as more real data emerges.
The split test is implemented through the use of the Hoeffding 90%
bound which is expressed as: 85%
80%
1
Accuracy (%)
75%
ln 70%
∈ (1) 65%
2 60%
The Hoeffding bound expressed in (1) above states that 55%
50%
with confidence (1- , the population mean of R lies in the 45%
interval, -∈, +∈ , where is the sample (observable) mean 40%
of the random variable R. In the context of decision tree
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
200
300
400
500
600
700
800
900
induction R refers to information gain. The Information gain
function ranges in value from 0 to , where c is the No. of Trained Instances
number of classes. Since c=2 in the mining problem that we Figure 1. Hoeffding Tree Overall Classification Accuracy.
undertake (since only the outcomes, success and failure are
possible), R reduces to 1. The variable n refers to the number of The initial instability in classification accuracy is an
data instances seen up to the point that the test was carried out. interesting phenomenon, given the initial grace period of 200
The bound holds is true irrespective of the underlying data builds is intended to provide stability in the emerging model.
distribution generating the values and only depends on a range Upon examination of the synthetic data it can be observed that
of values, number of observations made and a split confidence the data maintains comparable instances of each class up until
level. The Hoeffding tree uses the Hoeffding bound to 900 builds. After 900 builds, the data contains an increasing
determine whether an existing (leaf) node should be split as proportion of successful builds. This is shown in Figure 2.
follows. Suppose that after n data instances have arrived, the
difference in information gain between the two highest ranking Failed
attributes Xa and Xb with ∆ ̅ ̅ ̅ Successful
(i.e. Xa is 1400
Cumulate No. of Builds
the attribute with the highest information gain), then with 1200
confidence (1- , the Hoeffding bound guarantees that the 1000
correct choice to spilt the given leaf node is attribute Xa if 800
∆ ̅ ∈, where is a tie threshold parameter. 600
400
In this research we use the Hoeffding tree implementation 200
from MOA [13], a real time analytics tool for data streams was 0
0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
used for mining data streams.
No. of Trained Instances
V. EXPERIMENTAL STUDY
The original software metric data set consists of 199 Jazz Figure 2. Build Distribution over Time
build instances. From these instances there are 127 successful
Figure 3 presents the classification accuracies of successful
builds and 72 failed builds. Build instances are sorted by date
builds. It is observed that the general trend for classifying
to ensure accurate simulation of a development team working
over time. SMOTE is then applied twice at 900%, increasing success initially declines to reach a minimum at approximately
the number of instances to 1,990 (1270 successful builds and 900 instances, after which there is a gradual improvement that
720 failed builds). The first application increases the number of appears to be trending towards a stable value of around 80%.
minority class instances (failed builds) and the second Figure 4 displays the sensitivity ratings for successful builds
application increases the temporarily "new" number of over time. For successful builds the accuracy at the beginning
minority class instances (successful builds). The instances are of the data stream time series was 66.38% and ended with
then encoded into data streams which are utilized by the 79.1% (with an average of 64%).
Hoeffding tree for the data mining process. Three parameters 90%
were set for the tree induction. The Hoeffding tree uses a grace 85%
period parameter which stipulates the frequency with which 80%
Accuracy (%)
75%
checks for leaf node splits are carried out, the greater the value 70%
the higher the efficiency of the process. We use a setting of 200 65%
for the grace period parameter. The tie threshold parameter, 60%
55%
that controls the degree of splitting, was set to 0.05. 50%
45%
Presented in Figure 1 is the classification accuracy obtained 40%
with the use of after state metrics for builds. The classification
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
accuracy at the start of the time series was 65.2% and at the end
No. of Trained Instances
of the stream the accuracy grew to 80.25%. The average overall
accuracy over the entire time series was 70%. This indicates Figure 3. Hoeffding Tree Classification Accuracy for Successful Builds.
truePositive 30] that suggests that failed builds are harder to classify than
falsePositive
0.9 successful builds. This work suggests that failed builds may be
0.8 harder to classify when there is a significantly larger number of
successful builds that dominate the classification model.
Sensitivity Rating
0.7
0.6
0.5
Figure 7 illustrates the final decision tree using the
0.4 Hoeffding Tree stream mining technique on the extended RSA
0.3 after state software metrics data set. In this case the tree is
0.2 larger than the previous software metric based Hoeffding tree,
0.1 with a depth of 7. Upon inspecting the tree there are common
0 sense classifications being made, for example a higher number
of interfaces tend to be associated with failure. This is intuitive
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
200
300
400
500
600
700
800
900
75%
70%
65%
60%
55%
50%
45% Figure 7. Final Hoeffding Tree for After State Software Metrics.
40%
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900