On the need of preserving order of data when validating within-project defect classifiers

D Falessi, J Huang, L Narayana, JF Thai… - arXiv preprint arXiv …, 2018 - arxiv.org
D Falessi, J Huang, L Narayana, JF Thai, B Turhan
arXiv preprint arXiv:1809.01510, 2018arxiv.org
[Context] The use of defect prediction models, such as classifiers, can support testing
resource allocations by using data of the previous releases of the same project for predicting
which software components are likely to be defective. A validation technique, hereinafter
technique defines a specific way to split available data in training and test sets to measure a
classifier accuracy. Time-series techniques have the unique ability to preserve the temporal
order of data; ie, preventing the testing set to have data antecedent to the training set.[Aim] …
[Context]
The use of defect prediction models, such as classifiers, can support testing resource allocations by using data of the previous releases of the same project for predicting which software components are likely to be defective. A validation technique, hereinafter technique defines a specific way to split available data in training and test sets to measure a classifier accuracy. Time-series techniques have the unique ability to preserve the temporal order of data; i.e., preventing the testing set to have data antecedent to the training set.
[Aim]
The aim of this paper is twofold: first we check if there is a difference in the classifiers accuracy measured by time-series versus non-time-series techniques. Afterward, we check for a possible reason for this difference, i.e., if defect rates change across releases of a project.
[Method]
Our method consists of measuring the accuracy, i.e., AUC, of 10 classifiers on 13 open and two closed projects by using three validation techniques, namely cross validation, bootstrap, and walk-forward, where only the latter is a time-series technique.
[Results]
We find that the AUC of the same classifier used on the same project and measured by 10-fold varies compared to when measured by walk-forward in the range [-0.20, 0.22], and it is statistically different in 45% of the cases. Similarly, the AUC measured by bootstrap varies compared to when measured by walk-forward in the range [-0.17, 0.43], and it is statistically different in 56% of the cases.
[Conclusions]
We recommend choosing the technique to be used by carefully considering the conclusions to draw, the property of the available datasets, and the level of realism with the classifier usage scenario.
arxiv.org