Synthetic Minority Over-Sampling Technique (Smote) For Predicting Software Build Outcomes

This document discusses using the SMOTE algorithm to synthetically generate additional data instances from a limited set of data in the Jazz software repository. The goal is to improve the accuracy of decision tree models for predicting software build outcomes over time as more data becomes available. The authors apply a data stream mining approach using Hoeffding trees to track prediction model evolution while dealing with dynamic, continuous data that is eventually discarded from the repository.

Uploaded by

Prakash Arumugam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

147 views6 pages

Synthetic Minority Over-Sampling Technique (Smote) For Predicting Software Build Outcomes

Uploaded by

Prakash Arumugam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Synthetic Minority Over-sampling TEchnique

(SMOTE) for Predicting Software Build Outcomes

Russel Pears & Jacqui Finlay Andy M. Connor

School of Computing & Mathematical Sciences Colab
Auckland University of Technology Auckland University of Technology
Auckland, New Zealand Auckland, New Zealand
russel.pears@aut.ac.nz andrew.connor@aut.ac.nz

Abstract— In this research we use a data stream approach to categorized as success or failure. As the volume of data
mining data and construct Decision Tree models that predict associated with each build is large only a limited number of
software build outcomes in terms of software metrics that are build instances are actually stored in the repository on a first-in,
derived from source code used in the software construction first-out basis. Traditional data mining methods are tailored to
process. The rationale for using the data stream approach was to static data environments where the data is retained. The first
track the evolution of the prediction model over time as builds major challenge in mining software repositories is dealing with
are incrementally constructed from previous versions either to dynamic data that arrives on a continuous basis. Our previous
remedy errors or to enhance functionality. As the volume of data work [21] addressed this challenge by modeling the
available for mining from the software repository that we used
development process as a data stream to deal with software
was limited, we synthesized new data instances through the
application of the SMOTE oversampling algorithm. The results
project data that is produced continuously and accumulated
indicate that a small number of the available metrics have over a period of time before being discarded. The data stream
significance for prediction software build outcomes. It is observed mining approach was shown to be effective at maintaining
that classification accuracy steadily improves after knowledge related to the development project even after the
approximately 900 instances of builds have been fed to the data is discarded.
classifier. At the end of the data streaming process classification This paper attempts to extend our work to address the
accuracies of 80% were achieved, though some bias arises due to
second challenge, namely to deal with the limited volume of
the distribution of data across the two classes over time.
data and improve the accuracy of the prediction event. In order
Keywords- SMOTE, Data Stream Mining, Jazz, Software to boost the training power of the limited quantity of available
Metrics, Software Repositories. data the SMOTE [2] oversampling algorithm was applied to
synthesize new data instances from the available instances prior
I. INTRODUCTION to inducing a decision tree model implemented via the
Hoeffding [3] tree method. For the simulation to be realistic the
The Mining Software Repositories (MSR) field analyses the
naturally occurring distribution of successful and failed build
rich data available in software repositories to uncover
instances occurring in the original population was maintained
interesting and actionable information about software systems
in the oversampling process.
and projects. This enables researchers to reveal interesting
patterns and information about the development of software The dynamic nature of software and the resulting changes
systems. MSR has been a very active research area since 2004 in software development strategies over time causes changes in
[4]. Until the emergence of MSR as a research endeavor, the the patterns that govern software project outcomes. This
data from software repositories were mostly used as historical phenomenon has been recognized in many other domains and
records for supporting development activities. Analysis of is referred to as concept drift. Changes in a data stream can
MSR research has shown that the approach of extracting evolve slowly or quickly and rates of change can be queried
knowledge from a repository has the potential to be a valuable within stream-based tools. This paper describes an attempt to
method for analyzing the software development process for improve build outcome prediction accuracies for the Jazz
many domains [5]. However there are a number of data related project by synthetically creating data to boost the training
challenges, one of which is how to deal with repositories that power of the data stream mining approach while taking into
contain insufficient or imbalanced data because the project is account concept drift that occurs as part of the stream.
still immature or because the development environment is such
that data is discarded over time. II. BACKGROUND AND RELATED WORK
Our previous work [21] developed an approach for This research draws from multiple areas to inform the
applying data stream mining techniques to overcome the direction of inquiry, in particular it is placed in the context of
challenge of discarded data utilizing the Jazz repository [1]. other research related to Mining Software Repositories
The Jazz repository stores data, including source code, related research, specifically in the context of the Jazz repository. In
to each software build attempt and retains the build outcome, addition, it uses experience gained applying data stream mining
and synthetic data generation in other domains to improve the short timescales. It therefore seems counter-intuitive to deploy
prediction models developed for the Jazz project. synthetic data generation techniques in conjunction with a data
stream mining approach. However, the discarding of data from
A. Mining the Jazz Repository Jazz does not encourage the use of static classification
The Jazz development environment has been recognized as approaches in practice, even though such approaches can be
offering new opportunities in terms of MSR research because it deployed on any given snapshot of the repository [22].
integrates the software source code archive and bug database Deploying a data stream method in conjunction with synthetic
by linking bug reports and source code changes with each other data generation allows a consistent approach to be used in
[6]. Whilst this provides much potential in gaining valuable practice for new projects. Synthetic data can be generated from
insights into the development process of software projects, the limited quantity of actual data that is available in the early
such potential is yet to be fully realized. To date, much of the stages of development, and a gradual phasing out of such
work focused on the Jazz repository is related to predicting synthetic data can be carried out when larger volumes of real
build success, either through social network analysis [7] or data become available. Such consistency is important if data
source code metrics [21, 22]. As is common with much MSR mining approaches are to become useful to software
research, the goal of working with the Jazz repository is in line practitioners.
with a key direction identified in the field [23], which is the
transformation of software repositories from static record- Synthetic data generation has been a research area for some
keeping ones into active repositories in order to guide decision time, with the literature containing many examples of random
processes in modern software projects. or pseudo-random data generation [25]. However, the goal of
our research is such that synthetic data must be representative
B. Data Stream Mining of the real data and therefore a more refined generation
The mining of data streams has arisen as a necessity due to approach is required. Such approaches include DataBoost-IM
advances in hardware/software that have enabled the capture of [26], ADASYN [27] and SMOTE [2] to name but a few. Many
different measurements of data in a wide range of fields [24]. of these approaches are based on similar sampling algorithms
Data streams are typically generated continuously and have and in this work we have elected to apply the standard SMOTE
very high fluctuating data rates. The storage, querying and algorithm as it has been effectively applied in many domains.
mining of such data sets are computationally challenging tasks
[24]. Research problems and challenges that have been arisen III. THE JAZZ DATASET
in mining data streams can be solved using well-established IBM Jazz is a fully integrated software development tool
statistical and computational approaches that can be that automatically captures software development processes
categorized as either data-based or task-based ones. In data- and artifacts. The Jazz repository contains real-time evidence
based solutions, only a subset of the whole dataset is examined that allows researchers to gain insights into team collaboration
or the data is transformed to an approximate smaller size and development activities within software engineering
representation. Task-based solutions involve applying projects [1, 7]. The Jazz repository artifacts include work items,
techniques from computational theory to achieve time and build items, change sets, source code files, authors and
space efficient solutions. Data-based solutions include comments. A work item is a description of a unit of work,
Sampling, Load Shredding, Sketching and Aggregation. Task- which is categorized as a task, enhancement or defect. A build
based solutions include Approximation Algorithms and Sliding item is compiled software to form a working unit. A change set
Window approaches, all of which have received considerable is a collection of code changes in a number of files. In Jazz a
attention by researchers [28]. The discarding of data from the change set is created by one author only and relates to one
Jazz environment and the relatively low data rate lends itself to work item. A single work item may contain many change sets.
a Sliding Window solution. Source code files are included in change sets and over time can
be related to multiple change sets.
In addition, various data mining approaches can be applied
to mining data streams, including clustering, frequency One of the challenges associated with working with the
counting and classification. The nature of the data, which Jazz repository is that the data contains holes and misleading
includes a classifiable attribute in terms of build outcome, lends elements which cannot be removed or identified easily. This is
itself to a classification method. In this work, we have applied because the Jazz environment has been used within the
the Hoeffding tree incremental learner in conjunction with the development of itself; therefore many features provided by Jazz
Adaptive Sliding Window (ADWIN) concept drift detector. were not implemented at early stages of the project. This
ADWIN is a parameter-free adaptive sliding window drift sparseness of the data has driven the decision to focus on using
detector that compares all adjacent sub-windows in given data software metrics as the predictor attributes. Whilst features of
window in order to detect a concept drift point [29]. This the Jazz environment may not have been present during early
method is recognized to produce high true positive and low phases of development, there has always been source code and
false positives rates while, having low detection delay times in therefore a consistent set of data can be created.
comparison to other drift detectors proposed in the data mining
literature [29]. IV. THEORETICAL FOUNDATIONS
Software metrics have been generated in order to deal with
C. Synthetic Data Generation
the sparseness of the data. Metric values can be derived from
Many of the challenges associated with data stream mining extracting development code from software repositories. Such
are related to dealing with high volumes of data in relatively metrics are commonly used within model-based project
management methods. Software metrics are used to measure generates more minority class samples for a classifier to learn
the complexity, quality and effort of a software development from, thereby allowing broader decision regions and coverage
project [8-12]. In the Jazz repository each software build [13]. SMOTE has been utilized within the software research
contains change sets that indicate the actual source code files community and compared with other sampling techniques in
that are modified during the implementation of the build. software quality modeling (random under-sampling, random
Source code metrics for each file are computed using the IBM oversampling, cluster-based oversampling and Borderline-
Software Analyzer tool. The builds after state was utilized in SMOTE) and has yielded encouraging results [5, 8]. SMOTE
order to ensure that the source code snapshot represented the has also been applied as a sampling strategy for software defect
actual software artifact that either failed or succeeded. prediction where data sets from NASA software project data
sets [10,16-18] and fault-prone module detection using the MIS
The Jazz repository consists of various types of software telecommunication systems [24]. For this work SMOTE is
builds. Included in this study were continuous builds (regular applied as a supervised instance filter using the Weka [19]
user builds), nightly builds (incorporating changes from the machine learning workbench.
local site) and integration builds (integrating components from
remote sites). As a result the following basic, average basic, In order to avoid the over-fitting problem while expanding
dependency, complexity, cohesion and Halstead software minority class regions SMOTE generates new instances by
metrics were derived from the source code files for each build: operating within the existing feature space. New instance
values are derived from interpolation rather than extrapolation,
 Basic Software Metrics:
 Number of Types Per Package, Number of Comments, Lines of Code, so they still carry relevance to the underlying data set. For each
Comment/Code Ratio, Number of Import Statements, Number of minority class instance SMOTE interpolates values using a k-
Interfaces, Number of Methods, Number of Parameters, Number of Lines, nearest neighbor technique and creates attribute values for new
Average Number of Attributes Per Class, Average Number of Constructors data instances [8]. For each minority data a new synthetic data
Per Class, Average Number of Comments, Average Lines of Code Per
Method, Average Number of Methods, Average Number of Parameters. instance (I) is generated by taking the difference between the
feature vector of I and its nearest neighbor (J) belonging to the
 Dependency Metrics: same class, multiplying it by a random number between 0 and
 Abstractness, Afferent Coupling, Efferent Coupling, Maintainability index, 1 and then adding it to I. This creates a random line segment
Instability, Normalized Distance.
between every pair of existing features from instances I and J,
 Complexity Metrics: resulting in the creation of a new instance within the data set
 Average Block Depth, Average Cyclomatic Complexity. [13]. This process is repeated for the other k-1 neighbors of the
 Cohesion Metrics: minority instance I. As a result SMOTE generates more general
 Lack of Cohesion 1 (LCOM1), Lack of Cohesion 2 (LCOM2), Lack of regions from the minority class and decision tree classifiers are
Cohesion 3 (LCOM3). able to use the data set for better generalizations.
 Halstead Metrics: B. Hoeffding Tree
 Number of Operands, Number of Operators, Number of Unique Operands,
Number of Unique Operators, Program Volume, Difficulty Level, Effort to The Hoeffding tree is an incremental decision tree
Implement, Number of Delivered Bugs, Time to Implement, Program induction method. Using the Hoeffding bound, it ascertains the
Length, Program Level, Program Vocabulary Size. number of instances that are needed to split a given (decision)
A. Synthetic Minority Over-sampling TEchnique (SMOTE) node of a tree and operates within a certain precision that can
be predetermined [3]. This method has potential in terms of
When working with real world data it is often found that
predicting future outcomes of software builds with high
data sets are heavily comprised of "normal" instances with only
accuracy while working with real-world data. Rather than using
a small percentage representing interesting findings. As a result
training and test sets, instances are represented as streams. The
the "abnormal" instances have a negative impact on a models'
Hoeffding tree is commonly used for classifying high speed
performance as they have a greater probability of
data streams. The algorithm that it uses generates a decision
misclassification using data mining methods [2, 13]. Data
tree from data incrementally by inspecting each instance within
instances that introduce noise within the data are often found
a stream without the need to store instances for later retrieval.
within the minority class [14, 15]. In order to overcome this
The tree resides in memory during each iteration and stores
limitation synthetically under-sampling the majority class may
information in its branches and leaves, potentially growing
improve a classifiers' performance. However, in doing so
from "learning" every new instance. The decision tree itself can
valuable data may be lost and model over-fitting may occur,
be inspected at any time during the streaming process. The
resulting in majority instances being wrongly classified as
quality of the tree itself is comparable to that used by
minority instances when new, unseen data is presented to the
traditional mining techniques, even though instances are
classifier model that was induced [14]. Another solution is to
introduced in an incremental manner.
provide the classifier with more complete regions within the
feature space via creation of new instances that are synthesized Just as with traditional decision tree learners, the Hoeffding
form existing data instances. tree is easy to interpret, making it easier to understand how the
model works. In addition to this, decision tree learners have
SMOTE enables a data miner to over sample the minority
proven to provide accurate solutions to a wide range of
class to achieve potentially better classifier performance
problems that are based on multi-dimensional data. For
without loss of data [2, 13]. While other over-sampling
Hoeffding trees each node of a decision tree undergoes a test
methods exist, such as Rippers Loss Ratio and Naive Bayes
which may result in it being split into two or more child nodes
methods, SMOTE provides better levels of performance as it
and sending each instance down a relevant branch to its that the there is potential for the accuracy of prediction to
destination child node, depending on the values of its attributes. improve as more real data emerges.
The split test is implemented through the use of the Hoeffding 90%
bound which is expressed as: 85%
80%
1

Accuracy (%)
75%
ln 70%
∈ (1) 65%
2 60%
The Hoeffding bound expressed in (1) above states that 55%
50%
with confidence (1- , the population mean of R lies in the 45%
interval, -∈, +∈ , where is the sample (observable) mean 40%
of the random variable R. In the context of decision tree

1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
200
300
400
500
600
700
800
900
induction R refers to information gain. The Information gain
function ranges in value from 0 to , where c is the No. of Trained Instances
number of classes. Since c=2 in the mining problem that we Figure 1. Hoeffding Tree Overall Classification Accuracy.
undertake (since only the outcomes, success and failure are
possible), R reduces to 1. The variable n refers to the number of The initial instability in classification accuracy is an
data instances seen up to the point that the test was carried out. interesting phenomenon, given the initial grace period of 200
The bound holds is true irrespective of the underlying data builds is intended to provide stability in the emerging model.
distribution generating the values and only depends on a range Upon examination of the synthetic data it can be observed that
of values, number of observations made and a split confidence the data maintains comparable instances of each class up until
level. The Hoeffding tree uses the Hoeffding bound to 900 builds. After 900 builds, the data contains an increasing
determine whether an existing (leaf) node should be split as proportion of successful builds. This is shown in Figure 2.
follows. Suppose that after n data instances have arrived, the
difference in information gain between the two highest ranking Failed
attributes Xa and Xb with ∆ ̅ ̅ ̅ Successful
(i.e. Xa is 1400
Cumulate No. of Builds

the attribute with the highest information gain), then with 1200
confidence (1- , the Hoeffding bound guarantees that the 1000
correct choice to spilt the given leaf node is attribute Xa if 800
∆ ̅ ∈, where is a tie threshold parameter. 600
400
In this research we use the Hoeffding tree implementation 200
from MOA [13], a real time analytics tool for data streams was 0
0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
used for mining data streams.
No. of Trained Instances
V. EXPERIMENTAL STUDY
The original software metric data set consists of 199 Jazz Figure 2. Build Distribution over Time
build instances. From these instances there are 127 successful
Figure 3 presents the classification accuracies of successful
builds and 72 failed builds. Build instances are sorted by date
builds. It is observed that the general trend for classifying
to ensure accurate simulation of a development team working
over time. SMOTE is then applied twice at 900%, increasing success initially declines to reach a minimum at approximately
the number of instances to 1,990 (1270 successful builds and 900 instances, after which there is a gradual improvement that
720 failed builds). The first application increases the number of appears to be trending towards a stable value of around 80%.
minority class instances (failed builds) and the second Figure 4 displays the sensitivity ratings for successful builds
application increases the temporarily "new" number of over time. For successful builds the accuracy at the beginning
minority class instances (successful builds). The instances are of the data stream time series was 66.38% and ended with
then encoded into data streams which are utilized by the 79.1% (with an average of 64%).
Hoeffding tree for the data mining process. Three parameters 90%
were set for the tree induction. The Hoeffding tree uses a grace 85%
period parameter which stipulates the frequency with which 80%
Accuracy (%)

75%
checks for leaf node splits are carried out, the greater the value 70%
the higher the efficiency of the process. We use a setting of 200 65%
for the grace period parameter. The tie threshold parameter, 60%
55%
that controls the degree of splitting, was set to 0.05. 50%
45%
Presented in Figure 1 is the classification accuracy obtained 40%
with the use of after state metrics for builds. The classification
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900

accuracy at the start of the time series was 65.2% and at the end
No. of Trained Instances
of the stream the accuracy grew to 80.25%. The average overall
accuracy over the entire time series was 70%. This indicates Figure 3. Hoeffding Tree Classification Accuracy for Successful Builds.
truePositive 30] that suggests that failed builds are harder to classify than
falsePositive
0.9 successful builds. This work suggests that failed builds may be
0.8 harder to classify when there is a significantly larger number of
successful builds that dominate the classification model.
Sensitivity Rating

0.7
0.6
0.5
Figure 7 illustrates the final decision tree using the
0.4 Hoeffding Tree stream mining technique on the extended RSA
0.3 after state software metrics data set. In this case the tree is
0.2 larger than the previous software metric based Hoeffding tree,
0.1 with a depth of 7. Upon inspecting the tree there are common
0 sense classifications being made, for example a higher number
of interfaces tend to be associated with failure. This is intuitive
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
200
300
400
500
600
700
800
900

because if there are too many Java interfaces it can become

No. of Trained Instances tedious when debugging an error as the actual implementation
Figure 4. Hoeffding Tree Sensitivity Measurements for Successful Builds.
of the error may be in an obscure location. Interfaces also add
to the collection of files within the system and if an interface is
Presented in Figure 5 are the classification accuracies over "dead" (not used) and not removed it leads to a less elegant
time for failed builds and the corresponding sensitivity ratings system design. The number of interfaces has a direct influence
for failed builds are presented in Figure 6. Classification on dependency metrics, i.e. Abstractness.
accuracy for failed builds started at 63.5% and at the end of the
time series was 82.2% (with an average of 78%). The false
positive values between 700 to 1000 trained instances appear to
peak when classifying failed builds, due to over-fitting the
model at earlier time segments. The false positive value then
proceeds to decrease over time
90%
85%
80%
Accuracy (%)

75%
70%
65%
60%
55%
50%
45% Figure 7. Final Hoeffding Tree for After State Software Metrics.
40%
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900

VI. LIMITATIONS AND FUTURE WORK

No. of Trained Instances The Jazz repository contains holes and misleading elements
which cannot be removed or identified easily. There is a great
Figure 5. Hoeffding Tree Classification Accuracy for Failed Builds.
challenge in dealing with such inconsistency and the
truePositive methodology has adopted an approach that delves further down
0.9
falsePositive the artifact chain than most previous work using Jazz. It is a
0.8 premise that the early software releases were functional, so
Sensitivity Rating

0.7 whilst the project “meta-data” may be missing details (such as

0.6 developer comments) the source code should represent a stable
0.5
0.4 system that can be analyzed to gain insight regarding the
0.3 development project. Even when comparing to other Jazz
0.2 studies there are concerns over validity that arise from the
0.1
0
possibility of different extraction techniques being applied.
However, the approach for creating a predictive model by
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900

mining data streams that relate to software data can be applied

No. of Trained Instances to other repositories and as such is a generalizable process.
Similarly, the process of using the predictive model to identify
Figure 6. Hoeffding Tree Sensitivity Measurements for Failed Builds.
build outcome risk and proactively manage the build scope and
Interestingly, the distribution of build instances across the activities is equally applicable to other projects. The actual
two classes has marginal impact on the overall classification prediction models are likely to be very different for other
accuracy when compared to the impact it has on the individual projects, but the techniques for developing them are entirely
classes themselves. When the distribution of classes in the generic. Other limitations from this study are products of the
synthetic data is roughly equal there is an increase in the relatively small sample size of build data from the Jazz project
classification accuracy of failed builds that is accompanied by a combined with the sparseness of the data itself. For example,
decrease in classification accuracy of successful builds. This the ratio of metrics (42) to builds (199) is such that it is difficult
seems at odds with the observation of previous work [21, 22, to truly identify significant metrics. Even though a sampling
technique (SMOTE) is applied to increase the number of [10] Pelayo, L. and Dick, S. "Applying novel resampling strategies to
instances, it is not possible to assess the extent to which the software defect prediction." In Fuzzy Information Processing Society,
2007. NAFIPS'07. Annual Meeting of the North American IEEE, pp. 69-
generated data reflects real-world data as there is the likelihood 72, 2007.
of unpredictable events in software development projects. [11] Shatnawi, R. "Improving software fault-prediction for imbalanced data."
Paper presented at the Innovations in Information Technology (IIT),
VII. CONCLUSIONS International Conference, 2012.
The goal of synthetically generating data was to explore [12] Kamei, Y., Monden, A., Matsumoto, S., Kakimoto, T., and Matsumoto,
K. I. "The effects of over and under sampling on fault-prone module
what might happen if there was more data available for mining, detection". First International Symposium on Empirical Software
more specifically to see if classification of builds improved Engineering and Measurement, pp. 196-204, 2007.
with more data. While the use of SMOTE may not be a “true” [13] Bifet, A., Holmes,G., Kirkby, R. and Pfahringer, B. "MOA: Massive
representation of future real world data, it does however Online Analysis." In Journal of Machine Learning Research, vol. 11, pp.
interpolate values between existing instances to generate new 1601-1604, 2010.
data that may be considered at representative of existing data. [14] Haibo, H. and Garcia, E. A. "Learning from Imbalanced Data". IEEE
Transactions on Knowledge and Data Engineering, vol. 21, 1263-1284,
This provides insights into what may occur if there were no 2009.
"new" anomalies encountered during the project. This may not [15] Jeatrakul, P., Kok Wai, W., Chun Che, F. and Takama, Y.
be entirely realistic given that the causes of failure are not "Misclassification analysis for the class imbalance problem". Paper
predictable and that new failure modes are likely to appear over presented at the World Automation Congress (WAC), 2010.
time. From previous data mining experiments it was observed [16] Gray, D., Bowes, D., Davey, N., Sun, Y. and Christianson, B. "Using the
that build failure metrics were often overlapped in value with Support Vector Machine as a Classification Method for Software Defect
those of successful builds, thus challenging the ability of a Prediction with Static Code Metrics. Engineering Applications of Neural
Networks", pp. 223-234, 2009.
classifier to distinguish between these two types of build [17] Jiang, Y., Li, M., and Zhou, Z. H. "Software defect detection with
outcomes. This indicates that if more data is available accuracy ROCUS". Journal of Computer Science and Technology, vol. 26, pp.
for classifying builds may improve over time. The results 328-342, 2011.
obtained during this phase support other studies where software [18] Seliya, N., Khoshgoftaar, T. M. and Hulse, J. V. "Predicting Faults in
build outcome prediction accuracy and stability both increased High Assurance Software". Paper presented at the Proceedings of the
when adopting the use of SMOTE [10, 11, 20] on other project 2010 IEEE 12th International Symposium on High-Assurance Systems
Engineering, 2010.
software metrics. While this research has gone some way to
[19] Hall, M., Frank, E. Holmes, G., Pfahringer, B., Reutemann, P. and
addressing the challenges associated with data mining software Witten, I. H. "The WEKA Data Mining Software: An Update;" SIGKDD
repositories, there is still much potential for future work in Explorations, vol. 11, pp. 10-18, 2009.
understanding evolving success and failure patterns found [20] Kehan, G., Khoshgoftaar, T. M., and Napolitano, A. "Impact of Data
within the SDLC. Sampling on Stability of Feature Selection for Software Measurement
Data". Paper presented at the Tools with Artificial Intelligence (ICTAI),
REFERENCES 2011 23rd IEEE International Conference, 2011
[21] Finlay, J., Pears, R. & Connor, A.M. “Data stream mining for predicting
[1] Nguyen, T., Schröter, A., & Damian, D. "Mining Jazz: An experience software build outcomes using source code metrics”, Information &
report". Proceedings of the Infrastructure for Research in Collaborative Software Technology, 56(2), 183-198, 2014.
Software Engineering Conference, 2008
[22] Finlay, J., Connor, A.M. & Pears, R. “Mining software metrics from
[2] Chawla, N., Bowyer, K., Hall, L. and Kegelmeyer, W. "SMOTE: Jazz”, Software Engineering Research,Management and Applications
Synthetic Minority Over-sampling TEchnique." Journal of Artificial 2011, Springer Berlin / Heidelberg. 377: 95-111, 2011.
Intelligence Research, vol. 16, pp. 341-378, 2002.
[23] Hassan, A.E., “The road ahead for Mining Software Repositories”
[3] Giannella, C., Han, J., Pei, J., Yan, X., and Yu, P. S. "Mining frequent Frontiers of Software Maintenance, 2008. FoSM 2008. pp.48,57, Sept.
patterns in data streams at multiple time granularities". Next generation 28 2008-Oct. 4 2008
data mining, pp. 191-212, 2003.
[24] Mohamed Medhat Gaber, Arkady Zaslavsky, and Shonali
[4] Kagdi, H., Collard, M., and Maletic, J. "A survey and taxonomy of Krishnaswamy. 2005. Mining data streams: a review. SIGMOD Rec. 34,
approaches for mining software repositories in the context of software 2 (June 2005), 18-26.
evolution". Journal of Software Maintenance and Evolution: Research
and Practice, vol. 19, pp. 77 - 131, 2007. [25] Jim Gray, Prakash Sundaresan, Susanne Englert, Ken Baclawski, and
Peter J. Weinberger. 1994. Quickly generating billion-record synthetic
[5] Poncin, W., Serebrenik, A., and Brand, M. "Process mining software databases. SIGMOD Rec. 23, 2 (May 1994), 243-252.
repositories". 15th European Conference In Software Maintenance and
Reengineering (CSMR), pp. 5-14, 2011. [26] Hongyu Guo and Herna L. Viktor. 2004. Learning from imbalanced data
sets with boosting and data generation: the DataBoost-IM approach.
[6] Herzig, K., and Zeller, A. "Mining the Jazz Repository: Challenges and SIGKDD Explor. Newsl. 6, 1 (June 2004), 30-39.
Opportunities". Mining Software Repositories MSR '09. 6th IEEE
International Working Conference, pp. 159-162, 2009. [27] Haibo He; Yang Bai; Garcia, E.A.; Shutao Li, "ADASYN: Adaptive
synthetic sampling approach for imbalanced learning," Neural Networks,
[7] Wolf, T., Schroter, A., Damian, D., and Nguyen, T. "Predicting build 2008. IJCNN 2008. (IEEE World Congress on Computational
failures using social network analysis on developer communication". Intelligence). IEEE International Joint Conference on , vol., no.,
Proceedings of the IEEE International Conference on Software pp.1322,1328, 1-8 June 2008
Engineering (ICSE), 2009.
[28] Aggarwal, C. C. (Ed.). (2007). Data streams: models and algorithms
[8] Drown, D. J., Khoshgoftaar, T.M. Seliya, N. "Evolutionary Sampling (Vol. 31). Springer.
and Software Quality Modeling of High-Assurance Systems." Systems,
Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions, [29] Bifet, A., Adaptive learning and mining for data streams and frequent
vol. 39, pp. 1097-1107, 2009. patterns. SIGKDD Explor. Newsl., 2009. 11(1): p. 55-56.
[9] Manduchi,G. and Taliercio, C. "Measuring software evolution at a [30] Connor, A.M., Finlay, J.A. and Pears, R. “Mining Developer
nuclear fusion experiment site: a test case for the applicability of OO and Communication Data Streams”. Proceedings of the Fourth International
reuse metrics in software characterization", Information and Software Conference on Computer Science and Information Technology, 2014.
Technology, vol. 44, pp. 593-600, 2002.

SMOTE For Imbalanced Classification With Python
No ratings yet
SMOTE For Imbalanced Classification With Python
75 pages
Expert Systems With Applications: Moloud Abdar, Mariam Zomorodi-Moghadam, Resul Das, I-Hsien Ting
No ratings yet
Expert Systems With Applications: Moloud Abdar, Mariam Zomorodi-Moghadam, Resul Das, I-Hsien Ting
13 pages
Confusion Matrix and Performance Evaluation Metrics
No ratings yet
Confusion Matrix and Performance Evaluation Metrics
13 pages
Confusion Matrix
No ratings yet
Confusion Matrix
18 pages
DingLi MOS Measurement Tool
No ratings yet
DingLi MOS Measurement Tool
17 pages
Pilot Fleet
No ratings yet
Pilot Fleet
2 pages
SMOTE: Improving Classifier Performance
No ratings yet
SMOTE: Improving Classifier Performance
37 pages
Ensemble Learning and Random Forests
No ratings yet
Ensemble Learning and Random Forests
151 pages
PRTG Desktop Manual
No ratings yet
PRTG Desktop Manual
113 pages
Data Mining Architecture Guide
No ratings yet
Data Mining Architecture Guide
7 pages
945-Article Text-2920-1-10-20190802
No ratings yet
945-Article Text-2920-1-10-20190802
6 pages
Complete Reference C5.0 Good
No ratings yet
Complete Reference C5.0 Good
27 pages
Confusion Matrix in Machine Learning
No ratings yet
Confusion Matrix in Machine Learning
6 pages
Computer Forensics
No ratings yet
Computer Forensics
25 pages
Clickjacking PoC Tool Guide
No ratings yet
Clickjacking PoC Tool Guide
2 pages
Leverage The Data Trapped in Unstructured Sources With Data Extraction
No ratings yet
Leverage The Data Trapped in Unstructured Sources With Data Extraction
27 pages
The Next Frontier For Innovation, Competition and Productivity
No ratings yet
The Next Frontier For Innovation, Competition and Productivity
23 pages
Machine Learning Student Grade Prediction
No ratings yet
Machine Learning Student Grade Prediction
14 pages
DW Life Cycle
No ratings yet
DW Life Cycle
114 pages
A Study On Credit Card Fraud Detection Using Machine Learning
No ratings yet
A Study On Credit Card Fraud Detection Using Machine Learning
4 pages
Weka A Tool For Exploratory Data Mining
No ratings yet
Weka A Tool For Exploratory Data Mining
157 pages
Data Warehousing Concepts Guide
No ratings yet
Data Warehousing Concepts Guide
68 pages
Used Telecom Equipment: Adjacency Matrix of A Graph
No ratings yet
Used Telecom Equipment: Adjacency Matrix of A Graph
4 pages
DBMS Lectures Compile
No ratings yet
DBMS Lectures Compile
183 pages
Travel Time Prediction Using Random Forest
100% (1)
Travel Time Prediction Using Random Forest
55 pages
Data Warehouse
No ratings yet
Data Warehouse
49 pages
Streamlit Interface For Multiple Disease Diagnosis
No ratings yet
Streamlit Interface For Multiple Disease Diagnosis
8 pages
Confusion Matrix
No ratings yet
Confusion Matrix
14 pages
Data Lake 1
No ratings yet
Data Lake 1
19 pages
Data Mart Info
No ratings yet
Data Mart Info
5 pages
Malware Classification & Damages
No ratings yet
Malware Classification & Damages
7 pages
Python For R Users
No ratings yet
Python For R Users
34 pages
Bumper Book
No ratings yet
Bumper Book
90 pages
AI Driven Companies in Egypt
No ratings yet
AI Driven Companies in Egypt
16 pages
A Rule-Based Inference Engine PDF
No ratings yet
A Rule-Based Inference Engine PDF
14 pages
A Case Study On Clickjacking Attack and Location Leakage
No ratings yet
A Case Study On Clickjacking Attack and Location Leakage
12 pages
Business Inteligence
0% (1)
Business Inteligence
18 pages
Enterprise Architecture Views & Viewpoints
100% (1)
Enterprise Architecture Views & Viewpoints
12 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
11 pages
Machine Learning Regression Guide
No ratings yet
Machine Learning Regression Guide
6 pages
An Introduction To GUI Programming Using R
No ratings yet
An Introduction To GUI Programming Using R
25 pages
ABACUS Features US
No ratings yet
ABACUS Features US
2 pages
Sat - 63.Pdf - Crime Detction Using Machine Learning
No ratings yet
Sat - 63.Pdf - Crime Detction Using Machine Learning
11 pages
Query 1:: Unique Liquor Stores in Iowa
No ratings yet
Query 1:: Unique Liquor Stores in Iowa
3 pages
Save Data in Arabic in MySQL Database
No ratings yet
Save Data in Arabic in MySQL Database
2 pages
Convolution Neural Networks (CNN) : Ms. Anisha Mahato Assistant Professor (CSE Specialization)
No ratings yet
Convolution Neural Networks (CNN) : Ms. Anisha Mahato Assistant Professor (CSE Specialization)
97 pages
Desriptive Statistics - Zarni Amri
No ratings yet
Desriptive Statistics - Zarni Amri
57 pages
Advanced Information Retrieval Models
No ratings yet
Advanced Information Retrieval Models
261 pages
Tableau Exasol WhitePaper
No ratings yet
Tableau Exasol WhitePaper
9 pages
Big Data Analytics
No ratings yet
Big Data Analytics
287 pages
Data Analytics: Key Concepts & Terms
No ratings yet
Data Analytics: Key Concepts & Terms
22 pages
Analysis Vs Reporting
No ratings yet
Analysis Vs Reporting
21 pages
Mining Developer Communication Data Streams
No ratings yet
Mining Developer Communication Data Streams
13 pages
DMML
100% (2)
DMML
5 pages
Ijs DR 2310062
No ratings yet
Ijs DR 2310062
7 pages
Data Mining Techniques For Software Defect Prediction
No ratings yet
Data Mining Techniques For Software Defect Prediction
7 pages
Application of DM in Se
No ratings yet
Application of DM in Se
15 pages
Data Mining For Secure Software Engineering - Source Code Management Tool Case Study
No ratings yet
Data Mining For Secure Software Engineering - Source Code Management Tool Case Study
11 pages
An Evaluationof Source Code Mining Techniques
No ratings yet
An Evaluationof Source Code Mining Techniques
5 pages
2023 Ria - 37.05 - 04
No ratings yet
2023 Ria - 37.05 - 04
11 pages
Short M: Atlab Tutorial
No ratings yet
Short M: Atlab Tutorial
90 pages
20 Intrusion-Detection-For-Iot-Based-On-Improved-Genetic-2evnxeukex
No ratings yet
20 Intrusion-Detection-For-Iot-Based-On-Improved-Genetic-2evnxeukex
12 pages
MATLAB/LABVIEW Lab for EEE Students
No ratings yet
MATLAB/LABVIEW Lab for EEE Students
34 pages
Firefly Algorithm Based WSN Iot Security Enhancement With Machine Learning For Intrusion Detection
No ratings yet
Firefly Algorithm Based WSN Iot Security Enhancement With Machine Learning For Intrusion Detection
15 pages
Lightweight IoT Edge Intrusion Detection
No ratings yet
Lightweight IoT Edge Intrusion Detection
8 pages
Single Phase Separately Excited DC Motor Drives
No ratings yet
Single Phase Separately Excited DC Motor Drives
8 pages
Classification Techniques For Predicting Graduate Employability
No ratings yet
Classification Techniques For Predicting Graduate Employability
9 pages
V and Inverted V Curves of Synchronous Motor
No ratings yet
V and Inverted V Curves of Synchronous Motor
4 pages
Unit-Iv Rotor Side Controlled Induction Motor Drive
No ratings yet
Unit-Iv Rotor Side Controlled Induction Motor Drive
34 pages
Emission Thesis
100% (1)
Emission Thesis
98 pages
R&D Project Funding Agencies List
No ratings yet
R&D Project Funding Agencies List
16 pages
Amit - Automation Using Selenium
No ratings yet
Amit - Automation Using Selenium
30 pages
Mitigation Techniques
No ratings yet
Mitigation Techniques
59 pages
Electrical Machines - Unit II
No ratings yet
Electrical Machines - Unit II
70 pages
Seleniumppt PDF
No ratings yet
Seleniumppt PDF
17 pages
Psslabmanual 160713123329
No ratings yet
Psslabmanual 160713123329
82 pages
Selenium PP T
No ratings yet
Selenium PP T
17 pages
Hernandez v. San Juan-Santos, PDF
No ratings yet
Hernandez v. San Juan-Santos, PDF
13 pages
Three Witnesses For The Baptists
100% (1)
Three Witnesses For The Baptists
24 pages
Nepal Telecom Working Capital Analysis
67% (3)
Nepal Telecom Working Capital Analysis
47 pages
Timing Gear PDF
No ratings yet
Timing Gear PDF
12 pages
Section B Roll List - Sheet1
No ratings yet
Section B Roll List - Sheet1
1 page
English Grammar
No ratings yet
English Grammar
112 pages
Cheatsheet Black Magic
No ratings yet
Cheatsheet Black Magic
1 page
Ia Biology Ib
No ratings yet
Ia Biology Ib
7 pages
Cyber Crisis Management Plan
100% (7)
Cyber Crisis Management Plan
43 pages
The Adam Project - Wikipedia
No ratings yet
The Adam Project - Wikipedia
7 pages
Algebra Problem Solutions
No ratings yet
Algebra Problem Solutions
10 pages
Cardia Tumor Pathology
No ratings yet
Cardia Tumor Pathology
207 pages
Department of Psychology: 2018 2019 KOREA UNIVERSITY Graduate School Bulletin
No ratings yet
Department of Psychology: 2018 2019 KOREA UNIVERSITY Graduate School Bulletin
9 pages
Literary Analysis for Students
No ratings yet
Literary Analysis for Students
4 pages
WMJ September 2018
No ratings yet
WMJ September 2018
30 pages
M2150-Design-Guide-Issue-1.4 Ok Ok
100% (1)
M2150-Design-Guide-Issue-1.4 Ok Ok
73 pages
Call Up Letter SCN Jalandhar For SSC (Tech) 65 Oct 2025 Course-11-22
No ratings yet
Call Up Letter SCN Jalandhar For SSC (Tech) 65 Oct 2025 Course-11-22
12 pages
Critical Thinking for Students
100% (1)
Critical Thinking for Students
12 pages
Capitulo Jorge Ayala Libro Filosofia ROTH
No ratings yet
Capitulo Jorge Ayala Libro Filosofia ROTH
44 pages
Koans Discordianos Digital
No ratings yet
Koans Discordianos Digital
106 pages
Unani-Dr. Adnan Mastan
100% (1)
Unani-Dr. Adnan Mastan
37 pages
Oliviaflickinger
No ratings yet
Oliviaflickinger
3 pages
Liceo de Davao: Contemporary Philippine Arts From The Region First Monthly Examination
No ratings yet
Liceo de Davao: Contemporary Philippine Arts From The Region First Monthly Examination
4 pages
MC9C
No ratings yet
MC9C
2 pages
Judicial System in Islam
No ratings yet
Judicial System in Islam
56 pages
Bcba Demo
No ratings yet
Bcba Demo
5 pages
Compare and Contrast Paper 3
No ratings yet
Compare and Contrast Paper 3
2 pages
PDE Types
No ratings yet
PDE Types
28 pages
Nicky Hayes & Sue Orell - Psychology, An Introduction
100% (1)
Nicky Hayes & Sue Orell - Psychology, An Introduction
508 pages

Synthetic Minority Over-Sampling Technique (Smote) For Predicting Software Build Outcomes

Uploaded by

Synthetic Minority Over-Sampling Technique (Smote) For Predicting Software Build Outcomes

Uploaded by

Synthetic Minority Over-sampling TEchnique

(SMOTE) for Predicting Software Build Outcomes

Russel Pears & Jacqui Finlay Andy M. Connor

because if there are too many Java interfaces it can become

VI. LIMITATIONS AND FUTURE WORK

0.7 whilst the project “meta-data” may be missing details (such as

mining data streams that relate to software data can be applied

You might also like