ESE (Book) PDF
ESE (Book) PDF
Malhotra
EMPIRICAL RESEARCH IN SOFTWARE ENGINEERING                                                                                                                                                                                        EMPIRICAL
CONCEPTS, ANALYSIS, AND APPLICATIONS
                                                                                                                                                                                                                                  RESEARCH in
The book balances empirical research concepts with exercises, examples, and real-life case studies. The author discusses                                                                         Study
                                                                                                                                                                                                Definition
the process of developing predictive models, such as defect prediction and change prediction, on data collected from
source code repositories. She also covers the application of machine learning techniques in empirical software engineering,                                                 Reporting
                                                                                                                                                                             Results
                                                                                                                                                                                                                   Experimental
                                                                                                                                                                                                                      Design
                                                                                                                                                                                                                                  CONCEPTS, ANALYSIS,
includes guidelines for publishing and reporting results, and presents popular software tools for carrying out empirical
studies.                                                                                                                                                                                                                          AND APPLICATIONS
                                                                                                                                                                           Validating                              Mining Data
                                                                                                                                                                                                                      from
Ruchika Malhotra is an assistant professor in the Department of Software Engineering at Delhi Technological University                                                      Threats
                                                                                                                                                                                                                   Repositories
(formerly Delhi College of Engineering). She was awarded the prestigious UGC Raman Fellowship for pursuing post-
doctoral research in the Department of Computer and Information Science at Indiana University–Purdue University. She                                                               Model                     Data Analysis
                                                                                                                                                                               Development &                  & Statistical
earned her master’s and doctorate degrees in software engineering from the University School of Information Technology                                                         Interpretation                   Testing
of Guru Gobind Singh Indraprastha University. She received the IBM Best Faculty Award in 2013 and has published
more than 100 research papers in international journals and conferences. Her research interests include software testing,
improving software quality, statistical and adaptive prediction models, software metrics, neural nets modeling, and the
                                                                                                                                                                                                                                  Ruchika Malhotra
definition and validation of software metrics.
                                                                                                  K25508
                                                                                      ISBN: 978-1-4987-1972-8
                                   6000 Broken Sound Parkway, NW                                              90000
                                   Suite 300, Boca Raton, FL 33487
                                   711 Third Avenue
                                   New York, NY 10017                               9 781498 719728
           an informa business     2 Park Square, Milton Park
                                   Abingdon, Oxon OX14 4RN, UK                      w w w. c rc p r e s s . c o m                                                                                                                 A CHAPMAN & HALL BOOK
EMPIRICAL
RESEARCH in
SOFTWARE
ENGINEERING
CONCEPTS, ANALYSIS,
AND APPLICATIONS
EMPIRICAL
RESEARCH in
SOFTWARE
ENGINEERING
   CONCEPTS, ANALYSIS,
   AND APPLICATIONS
Ruchika Malhotra
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been
made to publish reliable data and information, but the author and publisher cannot assume responsibility for the valid-
ity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright
holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may
rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or uti-
lized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopy-
ing, microfilming, and recording, or in any information storage or retrieval system, without written permission from the
publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://
www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923,
978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For
organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
 I dedicate this book to my grandmother, the late Shrimathi Shakuntala Rani Malhotra,
for her infinite love, understanding, and support. Without her none of my success would
     have been possible and I would not be who I am today. I miss her very much.
Contents
  1. Introduction .............................................................................................................................1
     1.1 What Is Empirical Software Engineering? ................................................................ 1
     1.2 Overview of Empirical Studies ...................................................................................2
     1.3 Types of Empirical Studies ..........................................................................................3
           1.3.1 Experiment........................................................................................................4
           1.3.2 Case Study ........................................................................................................ 5
           1.3.3 Survey Research ...............................................................................................6
           1.3.4 Systematic Reviews.......................................................................................... 7
           1.3.5 Postmortem Analysis ...................................................................................... 8
     1.4 Empirical Study Process .............................................................................................. 8
           1.4.1 Study Definition ...............................................................................................9
           1.4.2 Experiment Design ........................................................................................ 10
           1.4.3 Research Conduct and Analysis .................................................................. 11
           1.4.4 Results Interpretation .................................................................................... 12
           1.4.5 Reporting ........................................................................................................ 12
           1.4.6 Characteristics of a Good Empirical Study ................................................ 12
     1.5 Ethics of Empirical Research ..................................................................................... 13
           1.5.1 Informed Content .......................................................................................... 14
           1.5.2 Scientific Value ............................................................................................... 15
           1.5.3 Confidentiality ............................................................................................... 15
           1.5.4 Beneficence...................................................................................................... 15
           1.5.5 Ethics and Open Source Software ............................................................... 15
           1.5.6 Concluding Remarks ..................................................................................... 15
     1.6 Importance of Empirical Research ........................................................................... 16
           1.6.1 Software Industry .......................................................................................... 16
           1.6.2 Academicians ................................................................................................. 16
           1.6.3 Researchers ..................................................................................................... 17
     1.7 Basic Elements of Empirical Research ..................................................................... 17
     1.8 Some Terminologies ................................................................................................... 18
           1.8.1 Software Quality and Software Evolution ................................................. 18
           1.8.2 Software Quality Attributes ......................................................................... 20
           1.8.3 Measures, Measurements, and Metrics ...................................................... 20
           1.8.4 Descriptive, Correlational, and Cause–Effect Research ...........................22
           1.8.5 Classification and Prediction .......................................................................22
           1.8.6 Quantitative and Qualitative Data ..............................................................22
           1.8.7 Independent, Dependent, and Confounding Variables ........................... 23
           1.8.8 Proprietary, Open Source, and University Software ................................ 24
           1.8.9 Within-Company and Cross-Company Analysis ..................................... 25
                                                                                                                                                  vii
viii                                                                                                                                   Contents
 3. Software Metrics...................................................................................................................65
    3.1 Introduction .................................................................................................................65
         3.1.1   What Are Software Metrics? ...................................................................... 66
         3.1.2   Application Areas of Metrics ..................................................................... 66
         3.1.3   Characteristics of Software Metrics .......................................................... 67
    3.2 Measurement Basics ................................................................................................... 67
         3.2.1   Product and Process Metrics ..................................................................... 68
         3.2.2   Measurement Scale...................................................................................... 69
    3.3 Measuring Size ............................................................................................................ 71
    3.4 Measuring Software Quality..................................................................................... 72
         3.4.1   Software Quality Metrics Based on Defects ............................................ 72
                 3.4.1.1 Defect Density .............................................................................. 72
                 3.4.1.2 Phase-Based Defect Density ....................................................... 73
                 3.4.1.3 Defect Removal Effectiveness .................................................... 73
Contents                                                                                                                                       ix
                                                                              Mark Harman
                                                                    University College London
                                                                                          xix
Preface
                                                                                          xxi
xxii                                                                                 Preface
  • Describes software metrics; the most popular metrics given to date are included
    and explained with the help of examples.
  • Provides an in-depth description of experimental design, including research ques-
    tions formation, literature review, variables description, hypothesis formulation,
    data collection, and selection of data analysis methods.
  • Provides a full chapter on mining data from software repositories. It presents the
    procedure for extracting data from software repositories such as CVS, SVN, and
    Git and provides applications of the data extracted from these repositories in the
    software engineering area.
  • Describes methods for analyzing data, hypothesis testing, model prediction, and
    interpreting results. It presents statistical tests with examples to demonstrate their
    use in the software engineering area.
  • Describes performance measures and model validation techniques. The guide-
    lines for using the statistical tests and performance measures are also provided.
    It also emphasizes the use of machine-learning techniques in predicting models
    along with the issues involved with these techniques.
  • Summarizes the categories of threats to validity with practical examples. A sum-
    mary of threats extracted from fault prediction studies is presented.
  • Provides guidelines to researchers and doctorate students for publishing and
    reporting the results. Research misconduct is discussed.
  • Presents the procedure for mining unstructured data using text mining tech-
    niques and describes the concepts with the help of examples and case studies. It sig-
    nifies the importance of text-mining procedures in extracting relevant information
    from software repositories and presents the steps in text mining.
  • Presents real-life research-based case studies on software quality prediction mod-
    els. The case studies are developed to demonstrate the procedures used in the
    chapters of the book.
  • Presents an overview of tools that are widely used in the software industry for
    carrying out empirical studies.
I take immense pleasure in presenting to you this book and hope that it will inspire
researchers worldwide to utilize the knowledge and knowhow contained for newer
applications and advancement of the frontiers of software engineering. The importance
of feedback from readers is important to continuously improve the contents of the book.
I welcome constructive criticism and comments about anything in the book; any omission
is due to my oversight. I will appreciatively receive suggestions, which will motivate me
to work hard on a next improved edition of the book; as Robert Frost rightly wrote:
                                                                         Ruchika Malhotra
                                                               Delhi Technological University
Acknowledgments
I am most thankful to my father for constantly encouraging me and giving me time and
unconditional support while writing this book. He has been a major source of inspiration
to me.
   This book is a result of the motivation of Yogesh Singh, Director, Netaji Subhas Institute
of Technology, Delhi, India. It was his idea that triggered this book. He has been a constant
source of inspiration for me throughout my research and teaching career. He has been con-
tinuously involved in evaluating and improving the chapters in this book; I will always be
indebted to him for his extensive support and guidance.
   I am extremely grateful to Mark Harman, professor of software engineering and head
of the Software Systems Engineering Group, University College London, UK, for the sup-
port he has given to the book in his foreword. He has contributed greatly to the field of
software engineering research and was the first to explore the use and relevance of search-
based techniques in software engineering.
   I am extremely grateful to Megha Khanna, assistant professor, Acharya Narendera Dev
College, University of Delhi, India, for constantly working with me in terms of solving
examples, preparing case studies, and reading texts. The book would not have been in its
current form without her support. My sincere thanks to her.
   My heartfelt gratitude is due to Ankita Jain Bansal, assistant professor, Netaji Subhas
Institute of Technology, Delhi, India, who worked closely with me during the evolution
of the book. She was continuously involved in modifying various portions in the book,
especially experimental design procedure and threshold analysis.
   I am grateful to Abha Jain, research scholar, Delhi Technological University, Delhi,
India, for helping me develop performance measures and text-mining examples for the
book. My thanks are also due to Kanishk Nagpal, software engineer, Samsung India
Electronics Limited, Delhi, India, who worked closely with me on mining repositories and
who developed the DCRS tool provided in Chapter 5.
   Thanks are due to all my doctoral students in the Department of Software Engineering,
Delhi Technological University, Delhi, India, for motivating me to explore and evolve
empirical research concepts and applications in software engineering. Thanks are also due
to my undergraduate and postgraduate students at the Department of Computer Science
and Software Engineering, Delhi Technological University, for motivating me to study
more before delivering lectures and exploring and developing various tools in several
projects. Their outlook, debates, and interest have been my main motivation for continu-
ous advancement in my academic pursuit. I also thank all researchers, scientists, practitio-
ners, software developers, and teachers whose insights, opinions, ideas, and techniques
find a place in this book.
   Thanks also to Rajeev Raje, professor, Department of Computer & Information Science,
Indiana University–Purdue University Indianapolis, Indianapolis, for his support and
valuable suggestions.
   Last, I am thankful to Manju Khari, assistant professor, Ambedkar Institute of
Technology, Delhi, India, for her support in gathering some of the material for the further
readings sections in the book.
                                                                                        xxiii
Author
                                                                                          xxv
1
Introduction
As the size and complexity of software is increasing, software organizations are facing
the pressure of delivering high-quality software within a specific time, budget, and avail-
able resources. The software development life cycle consists of a series of phases, includ-
ing requirements analysis, design, implementation, testing, integration, and maintenance.
Software professionals want to know which tools to use at each phase in software devel-
opment and desire effective allocation of available resources. The software planning team
attempts to estimate the cost and duration of software development, the software testers
want to identify the fault-prone modules, and the software managers seek to know which
tools and techniques can be used to reduce the delivery time and best utilize the man-
power. In addition, the software managers also desire to improve the software processes
so that the quality of the software can be enhanced. Traditionally, the software engineers
have been making decisions based on their intuition or individual expertise without any
scientific evidence or support on the benefits of a tool or a technique.
   Empirical studies are verified by observation or experiment and can provide powerful
evidence for testing a given hypothesis (Aggarwal et al. 2009). Like other disciplines, soft-
ware engineering has to adopt empirical methods that will help to plan, evaluate, assess,
monitor, control, predict, manage, and improve the way in which software products are
produced. An empirical study of real systems can help software organizations assess
large software systems quickly, at low costs. The application of empirical techniques is
especially beneficial for large-scale systems, where software professionals need to focus
their attention and resources on various activities of the system under development.
For example, developing a model for predicting faulty modules allows software organiza-
tions to identify faulty portions of source code so that testing activities can be planned
more effectively. Empirical studies such as surveys, systematic reviews and experimental
studies, help software practitioners to scientifically assess and validate the tools and tech-
niques in software development.
   In this chapter, an overview and the types of empirical studies are provided, the phases
of the experimental process are described, and the ethics involved in empirical research
of software engineering are summarized. Further, this chapter also discusses the key con-
cepts used in the book.
                                                                                            1
2                                                      Empirical Research in Software Engineering
within a specified time and budget. Fritz Bauer coined the term software engineering in 1968 at
the first conference on software engineering and defined it as (Naur and Randell 1969):
        The establishment and use of sound engineering principles in order to obtain economically
        developed software that is reliable and works efficiently on real machines.
The software engineering discipline facilitates the completion of the objective of delivering
good quality software to the customer following a systematic and scientific approach.
Empirical methods can be used in software engineering to provide scientific evidence on
the use of tools and techniques.
  Harman et al. (2012a) defined “empirical” as:
        “Empirical” is typically used to define any statement about the world that is related to
        observation or experience.
Empirical software engineering (ESE) is an area of research that emphasizes the use of empir-
ical methods in the field of software engineering. It involves methods for evaluating, assess-
ing, predicting, monitoring, and controlling the existing artifacts of software development.
  ESE applies quantitative methods to the software engineering phenomenon to understand
software development better. ESE has been gaining importance over the past few decades
because of the availability of vast data sets from open source repositories that contain
information about software requirements, bugs, and changes (Meyer et al. 2013).
Empirical studies are important in the area of software engineering as they allow software
professionals to evaluate and assess the new concepts, technologies, tools, and techniques
in scientific and proved manner. They also allow improving, managing, and controlling
the existing processes and techniques by using evidence obtained from the empirical
analysis. The empirical information can help software management in decision making
Introduction                                                                               3
                                                   Empirical
                                                    study
                              • Research questions
                              • Hypothesis formation
                              • Data collection
                              • Data analysis
                              • Model development and
                                validation
                              • Concluding results
FIGURE 1.1
Steps in empirical studies.
and improving software processes. The empirical studies involve the following steps
(Figure 1.1):
Empirical study allows to gather evidence that can be used to support the claims of
efficiency of a given technique or technology. Thus, empirical studies help in build-
ing a body of knowledge so that the processes and products are improved resulting in
high-quality software.
  Empirical studies are of many types, including surveys, systematic reviews, experi-
ments, and case studies.
   In qualitative research, the researchers study human behavior, preferences, and nature.
Qualitative research provides an in-depth analysis of the concept under investigation
and thus uses focused data for research. Understanding a new process or technique in
software engineering is an example of qualitative research. Qualitative research provides
textual descriptions or pictures related to human beliefs or behavior. It can be extended
to other studies with similar populations but generalizations of a particular phenomenon
may be difficult. Qualitative research involves methods such as observations, interviews,
and group discussions. This method is widely used in case studies.
   Qualitative research can be used to analyze and interpret the meaning of results produced
by quantitative research. Quantitative research generates numerical data for analysis,
whereas qualitative research generates non-numerical data (Creswell 1994). The data of
qualitative research is quite rich as compared to quantitative data. Table 1.1 summaries
the key differences between quantitative and qualitative research.
   The empirical studies can be further categorized as experimental, case study, systematic
review, survey, and post-mortem analysis. These categories are explained in the next sec-
tion. Figure 1.2 presents the quantitative and qualitative types of empirical studies.
1.3.1 Experiment
An experimental study tests the established hypothesis by finding the effect of variables of
interest (independent variables) on the outcome variable (dependent variable) using statis-
tical analysis. If the experiment is carried out correctly, the hypothesis is either accepted or
rejected. For example, one group uses technique A and the other group uses technique B,
which technique is more effective in detecting a larger number of defects? The researcher
may apply statistical tests to answer such questions. According to Kitchenham et al. (1995),
the experiments are small scale and must be controlled. The experiment must also con-
trol the confounding variables, which may affect the accuracy of the results produced by
the experiment. The experiments are carried out in a controlled environment and often
referred to as controlled experiments (Wohlin 2012).
   The key factors involved in the experiments are independent variables, dependent vari-
ables, hypothesis, and statistical techniques. The basic steps followed in experimental
    TABLE 1.1
    Comparison of Quantitative and Qualitative Research
                            Quantitative Research                        Qualitative Research
Experiment
Survey research
                                           Quantitative
                                                               Systematic
                                                                reviews
                     Empirical studies
                                                              Postmortem
                                                                analysis
FIGURE 1.2
Types of empirical studies.
                                           Experiment
    Experiment                Experiment                    Experiment         Experiment
                                             conduct
     definition                 design                     interpretation       reporting
                                           and analysis
FIGURE 1.3
Steps in experimental research.
research are shown in Figure  1.3. The same steps are followed in any empirical study
process however the content varies according to the specific study being carried out. In
first phase, experiment is defined. The next phase involves determining the experiment
design. In the third phase the experiment is executed as per the experiment design. Then,
the results are interpreted. Finally, the results are presented in the form of experiment
report. To carry out an empirical study, a replicated study (repeating a study with similar
settings or methods but different data sets or subjects), or to perform a survey of existing
empirical studies, the research methodology followed in these studies needs to be formu-
lated and described.
   A controlled experiment involves varying the variables (one or more) and keeping every-
thing else constant or the same and are usually conducted in small or laboratory setting
(Conradi and Wang 2003). Comparing two methods for defect detection is an example of a
controlled experiment in software engineering context.
method, or process (Kitchenham et  al. 1995). The effect of a change in an organization
can be studied using case study research. Case studies increase the understanding of the
phenomenon under study. For example, a case study can be used to examine whether a
unified model language (UML) tool is effective for a given project or not. The initial and
new concepts are analyzed and explored by exploratory case studies, whereas the already
existing concepts are tested and improvised by confirmatory case studies.
  The phases included in the case study are presented in Figure  1.4. The case study
design phase involves identifying existing objectives, cases, research questions, and
data-collection strategies. The case may be a tool, technology, technique, process, product,
individual, or software. Qualitative data is usually collected in a case study. The sources
include interviews, group discussions, or observations. The data may be directly or indi-
rectly collected from participants. Finally, the case study is executed, the results obtained
are analyzed, and the findings are reported. The report type may vary according to the
target audience.
  Case studies are appropriate where a phenomenon is to be studied for a longer period
of time so that the effects of the phenomenon can be observed. The disadvantages of case
studies include difficulty in generalization as they represent a typical situation. Since they
are based on a particular case, the validity of the results is questionable.
FIGURE 1.4
Case study phases.
Introduction                                                                                         7
outcome variable, a researcher may want to explain why an independent variable affects
the outcome variable.
The purpose of a systematic review is to summarize the existing research and provide
future guidelines for research by identifying gaps in the existing literature. A systematic
review involves:
The systematic reviews are performed in three phases: planning the review, conducting
the review, and reporting the results of the review. Figure 1.5 presents the summary of the
phases involved in systematic reviews.
  In the planning stage, the review protocol is developed that includes the following
steps: research questions identification, development of review protocol, and evaluation
of review protocol. During the development of review protocol the basic processes in
the review are planned. The research questions are formed that address the issues to be
                                                                                 • Documenting
                                                                 Reporting
                                                                                   the results
FIGURE 1.5
Phases of systematic review.
8                                                 Empirical Research in Software Engineering
Reporting
                                                                                       • Presenting
                                                                     Results
                                                                 interpretation             the results
                                              Research
                                            conduct and          • Theoretical and
                          Experiment           analysis            practical significance
                            design                                 of results
            Study                           • Descriptive
                          • Research                             • Limitations of the
          definition                          statistics
                            questions                              work
                                            • Attribute
        • Scope           • Hypothesis        reduction
        • Purpose           formulation     • Statistical
        • Motivation      • Defining          analysis
                            variables         • Model
        • Context
                          • Data                prediction and
                            collection          validation
                          • Selection of      • Hypothesis
                            data analysis       testing
                            methods
                          • Validity
                            threats
FIGURE 1.6
Empirical study phases.
The scope of the empirical study defines the extent of the investigation. It involves listing
down the specific goals and objectives of the experiment. The purpose of the study may be
to find the effect of a set of variables on the outcome variable or to prove that technique A
is superior to technique B. It also involves identifying the underlying hypothesis that is
formulated at later stages. The motivation of the experiment describes the reason for con-
ducting the study. For example, the motivation of the empirical study is to analyze and
assess the capability of a technique or method. The object of the study is the entity being
examined in the study. The entity in the study may be the process, product, or technique.
Perspective defines the view from which the study is conducted. For example, if the study
is conducted from the tester’s point of view then the tester will be interested in planning
and allocating resources to test faulty portions of the source code. Two important domains
in the study are programmers and programs (Basili et al. 1986).
10                                                   Empirical Research in Software Engineering
     1. Research questions: The first step is to formulate the research problem. This step states
        the problem in the form of questions and identifies the main concepts and relations
        to be explored. For example, the following questions may be addressed in empirical
        studies to find the relationship between software metrics and quality attributes:
       a. What will be the effect of software metrics on quality attributes (such as fault
          proneness/testing effort/maintenance effort) of a class?
       b. Are machine-learning methods adaptable to object-oriented systems for pre-
          dicting quality attributes?
       c. What will be the effect of software metrics on fault proneness when severity of
          faults is taken into account?
     2. Independent and dependent variables: To analyze relationships, the next step is to
        define the dependent and the independent variables. The outcome variable pre-
        dicted by the independent variables is called the dependent variable. For instance,
        the dependent variables of the models chosen for analysis may be fault proneness,
        testing effort, and maintenance effort. A variable used to predict or estimate a
        dependent variable is called the independent (explanatory) variable.
     3. Hypothesis formulation: The researcher should carefully state the hypothesis to
        be tested in the study. The hypothesis is tested on the sample data. On the basis
        of the result from the sample, a decision concerning the validity of the hypothesis
        (acception or rejection) is made.
           Consider an example where a hypothesis is to be formed for comparing a num-
        ber of methods for predicting fault-prone classes.
           For each method, M, the hypothesis in a given study is the following (the
        relevant null hypothesis is given in parentheses), where the capital H indicates
        “hypothesis.” For example:
       H–M: M outperform the compared methods for predicting fault-prone software classes
       (null hypothesis: M does not outperform the compared methods for predicting fault-
       prone software classes).
     4. Empirical data collection: The researcher decides the sources from which the
        data is to be collected. It is found from literature that the data collected is either
        from university/academic systems, commercial systems, or open source software.
        The researcher should state the environment in which the study is performed,
Introduction                                                                                    11
      programming language in which the systems are developed, size of the systems
      to be analyzed (lines of code [LOC] and number of classes), and the duration for
      which the system is developed.
   5. Empirical methods: The data analysis techniques are selected based on the type
      of the dependent variables used. An appropriate data analysis technique should
      be selected by identifying its strengths and weaknesses. For example, a number of
      techniques have been available for developing models to predict and analyze soft-
      ware quality attributes. These techniques could be statistical like linear regression
      and logistic regression or machine-learning techniques like decision trees, support
      vector machines, and so on. Apart from these techniques, there are a new set of
      techniques like particle swarm optimization, gene expression programming, and
      so on that are called the search-based techniques. The details of these techniques
      can be found in Chapter 7.
In the empirical study, the data is analyzed corresponding to the details given in the
experimental design. Thus, the experimental design phase must be carefully planned and
executed so that the analysis phase is clear and unambiguous. If the design phase does not
match the analysis part then it is most likely that the results produced are incorrect.
   1. Descriptive statistics: The data is validated for correctness before carrying out the
      analysis. The first step in the analysis is descriptive statistics. The research data
      must be suitably reduced so that the research data can be read easily and can be
      used for further analysis. Descriptive statistics concern development of certain
      indices or measures to summarize the data. The important statistics measures used
      for comparing different case studies include mean, median, and standard devia-
      tion. The data analysis methods are selected based on the type of the dependent
      variable being used. Statistical tests can be applied to accept or refute a hypothesis.
      Significance tests are performed for comparing the predicted performance of a
      method with other sets of methods. Moreover, effective data assessment should
      also yield outliers (Aggarwal et al. 2009).
   2. Attribute reduction: Feature subselection is an important step that identifies
      and removes as much of the irrelevant and redundant information as possible.
      The dimensionality of the data reduces the size of the hypothesis space and allows
      the methods to operate faster and more effectively (Hall 2000).
   3. Statistical analysis: The data collected can be analyzed using statistical analysis by
      following the steps below.
12                                                  Empirical Research in Software Engineering
       a. Model prediction: The multivariate analysis is used for the model prediction.
          Multivariate analysis is used to find the combined effect of each indepen-
          dent variable on the dependent variable. Based on the results of performance
          measures, the performance of models predicted is evaluated and the results
          are interpreted. Chapter 7 describes these performance measures.
       b. Model validation: In systems, where models are independently constructed from
          the training data (such as in data mining), the process of constructing the model is
          called training. The subsamples of data that are used to validate the initial analy-
          sis (by acting as “blind” data) are called validation data or test data. The valida-
          tion data is used for validating the model predicted in the previous step.
       c. Hypothesis testing: It determines whether the null hypothesis can be rejected at
          a specified confidence level. The confidence level is determined by the researcher
          and is usually less than 0.01 or 0.05 (refer Section 4.7 for details).
1.4.5 Reporting
Finally, after the empirical study has been conducted and interpreted, the study is reported
in the desired format. The results of the study can be disseminated in the form of a confer-
ence article, a journal paper, or a technical report.
  The results are to be reported from the reader’s perspective. Thus, the background,
motivation, analysis, design, results, and the discussion of the results must be clearly
documented. The audience may want to replicate or repeat the results of a study in a simi-
lar context. The experiment settings, data-collection methods, and design processes must
be reported in significant level of detail. For example, the descriptive statistics, statistical
tools, and parameter settings of techniques must be provided. In addition, graphical repre-
sentation should be used to represent the results. The results may be graphically presented
using pie charts, line graphs, box plots, and scatter plots.
    4. Valid: The experiment conclusions should be valid for a wide range of population.
    5. Unbiased: The researcher performing the study should not influence the results to sat-
       isfy the hypothesis. The research may produce some bias because of experiment error.
       The bias may be produced when the researcher selects the participants such that they
       generate the desired results. The measurement bias may occur during data collection.
    6. Control: The experiment design should be able to control the independent variables
       so that the confounding effects (interaction effects) of variables can be reduced.
    7. Replicable: Replication involves repeating the experiment with different data
       under same experimental conditions. If the replication is successful then this indi-
       cates generalizability and validity of the results.
    8. Repeatable: The experimenter should be able to reproduce the results of the study
       under similar settings.
TABLE 1.2
Examples of Unethical Research
S. No                                                Problem
1       Employees misleading the manager to protect himself or herself with the knowledge of the researcher
2       Nonconformance to a mandatory process
3       Revealing identities of the participant or organization
4       Manager unexpectedly joining a group interview or discussion with the participant
5       Experiment revealing identity of the participants of a nonperforming department in an organization
6       Experiment outcomes are used in employee ratings
7       Participants providing information off the record, that is, after interview or discussion is over
14                                                   Empirical Research in Software Engineering
  The ethical threats presented in Table 1.2  can be reduced by (1) presenting data and
results such that no information about the participant and the organization is revealed,
(2) presenting different reports to stakeholders, (3) providing findings to the participants
and giving them the right to withdraw any time during the research, and (4) providing
publication to companies for review before being published. Singer and Vinson (2001)
identified that the engineering and science ethics may not be related to empirical research
in software engineering. They provided the following four ethical principles:
     1. Research title: The title of the project must be included in the consent form.
     2. Contact details: The contact details (including ethics contact) will provide the
        participant information about whom to contact to clarify any questions or issues
        or complaints.
     3. Consent and comprehension: The participant actually gives the consent form in
        this section stating that they have understood the requirement of the research.
     4. Withdrawal: This section states that the participants can withdraw from the
        research without any penalty.
     5. Confidentiality: It states the confidentiality related to the research study.
     6. Risks and benefits: This section states the risks and benefits of the research to the
        participants.
     7. Clarification: The participants can ask for any further clarification at any time
        during the research.
     8. Signature: Finally, the participant signs the consent form with the date.
Introduction                                                                             15
1.5.3 Confidentiality
The information shared by the participants should be kept confidential. The researcher
should hide the identity of the organization and participant. Vinson and Singer (2008) iden-
tified three features of confidentiality—data privacy, participant anonymity, and data ano-
nymity. The data collected must be protected by password and only the people involved
in the research should have access to it. The data should not reveal the information about
the participant. The researchers should not collect personal information of participant. For
example, participant identity must be used instead of the participant name. The partici-
pant information hiding is achieved by hiding information from colleagues, professors,
and general public. Hiding information from the manager is particularly essential as it
may affect the career of the participants. The information must be also hidden from the
organization’s competitors.
1.5.4 Beneficence
The participants must be benefited by the research. Hence, methods that protect the inter-
est of the participants and do not harm them must be adopted. The research must not pose
a threat to the researcher’s job, for example, by creating an employee-ranking framework.
The revealing of an organization’s sensitive information may also bring loss to the company
in terms of reputation and clients. For example, if the names of companies are revealed in
the publication, the comparison between the processes followed in the companies or poten-
tial flaws in the processes followed may affect obtaining contracts from the clients. If the
research involves analyzing the process of the organization, the outcome of the research or
facts revealed from the research can harm the participants to a significant level.
participants must be to protect the interests of the participants so that they are protected
from any harm. Becker-Kornstaedt (2001) suggests that the participant interests can be
protected by using techniques such as manipulating data, providing different reports to
different stakeholders, and providing the right to withdraw to the participants.
  Finally, feedback of the research results must be provided to the participants. The opin-
ion of the participants about the validity of the results must also be asked. This will help
in increasing the trust between the researcher and the participant.
The predictive models constructed in ESE can be applied to future, similar industrial
applications. The empirical research enables software practitioners to use the results of the
experiment and ascertain that a set of good processes and procedures are followed dur-
ing software development. Thus, the empirical study can guide toward determining the
quality of the resultant software products and processes. For example, a new technique or
technology can be evaluated and assessed. The empirical study can help the software pro-
fessionals in effectively planning and allocating resources in the initial phases of software
development life cycle.
1.6.2 Academicians
While studying or conducting research, academicians are always curious to answer ques-
tions that are foremost in their minds. As the academicians dig deeper into their subject
or research, the questions tend to become more complex. Empirical research empowers
them with a great tool to find an answer by asking or interviewing different stakeholders,
Introduction                                                                              17
1.6.3 Researchers
From the researchers point of view, the results can be used to provide insight about exist-
ing trends and guidelines regarding future research. The empirical study can be repeated
or replicated by the researcher in order to establish generalizability of the results to new
subjects or data sets.
Purpose
Participants
Process Product
FIGURE 1.7
Elements of empirical research.
18                                               Empirical Research in Software Engineering
  Process lays down the way in which the research will be conducted. It defines the
sequence of steps taken to conduct a research. It provides details about the techniques,
methodologies, and procedures to be used in the research. The data-collection steps,
variables involved, techniques applied, and limitations of the study are defined in this
step. The process should be followed systematically to produce a successful research.
  Participants are the subjects involved in the research. The participants may be inter-
viewed or closely observed to obtain the research results. While dealing with participants,
ethical issues in ESE must be considered so that the participants are not harmed in any
way.
  Product is the outcome produced by the research. The final outcome provides the
answer to research questions in the empirical research. The new technique developed or
methodology produced can also be considered as a product of the research. The journal
paper, conference article, technical report, thesis, and book chapters are products of the
research.
The typical evolution process is depicted in Figure  1.8. The figure shows that a change
is requested by a stakeholder (anyone who is involved in the project) in the project. The
second step requires analyzing the cost of implementing the change and the impact of
the change on the related modules or components. It is the responsibility of an expert
group known as the change control board (CCB) to determine whether the change must be
implemented or not. On the basis of the outcome of the analysis, the CCB approves or dis-
approves a change. If the change is approved, then the developers implement the change.
Finally, the change and the portions affected by the change are tested and a new version of
the software is released. The process of continuously changing the software may decrease
the quality of the software.
   The main concerns during the evolution phase are maintaining the flexibility and qual-
ity of the software. Predicting defects, changes, efforts, and costs in the evolution phase
Introduction                                                                            19
                                            Test                    Request
                                           change                   change
                                Implement                                 Analyze
                                  change                                  change
                                                     Approve/
                                                       deny
FIGURE 1.8
Software evolution cycle.
           Defect prediction
            • What are the defect-prone portions in the maintanence phase?
Change prediction
FIGURE 1.9
Prediction during evolution phase.
20                                                              Empirical Research in Software Engineering
     1.   Functionality
     2.   Usability
     3.   Testability
     4.   Reliability
     5.   Maintainability
     6. Adaptability
The attribute domains can be further divided into attributes that are related to software
quality and are given in Figure 1.10. The details of software quality attributes are given in
Table 1.3.
                                                                   • Completeness
                                                                   • Correctness
                                                                   • Security
                                                      1            • Traceability
                                                                   • Efficiency
                                                 Functionality
     • Portability        6                                                               2   • Learnability
     • Interoperability       Adaptability                                    Usability       • Operability
                                                                                              • User-friendliness
                                                                                              • Installability
                                                   Software
                                                                                              • Satisfaction
                                                    quality
                                                   attributes
     • Agility
     • Modifiability           Maintainability                                                • Verifiability
                                                                            Testability
     • Readability                                                                            • Validatable
     • Flexibility        5                                                           3
Reliability
                                                      4           • Robustness
                                                                  • Recoverability
FIGURE 1.10
Software quality attributes.
Introduction                                                                                                21
TABLE 1.3
Software Quality Attributes
Functionality: The degree to which the purpose of the software is satisfied
1   Completeness             The degree to which the software is complete
2   Correctness              The degree to which the software is correct
3   Security                 The degree to which the software is able to prevent unauthorized access to the
                              program data
4   Traceability             The degree to which requirement is traceable to software design and source code
5   Efficiency               The degree to which the software requires resources to perform a software
                              function
Testability: The ease with which the software can be tested to demonstrate the faults
1    Verifiability           The degree to which the software deliverable meets the specified standards,
                              procedures, and process
2    Validatable             The ease with which the software can be executed to demonstrate whether the
                              established testing criteria is met
Maintainability: The ease with which the faults can be located and fixed, quality of the software can be
improved, or software can be modified in the maintenance phase
1   Agility                 The degree to which the software is quick to change or modify
2   Modifiability           The degree to which the software is easy to implement, modify, and test in the
                             maintenance phase
3   Readability             The degree to which the software documents and programs are easy to understand
                             so that the faults can be easily located and fixed in the maintenance phase
4   Flexibility             The ease with which changes can be made in the software in the maintenance
                             phase
Adaptability: The degree to which the software is adaptable to different technologies and platforms
1   Portability             The ease with which the software can be transferred from one platform to another
                             platform
2   Interoperability        The degree to which the system is compatible with other systems
For example, a measure is the number of failures experienced during testing. Measurement
is the way of recording such failures. A software metric may be the average number of
failures experienced per hour during testing.
   Fenton and Pfleeger (1996) has defined measurement as:
      It is the process by which numbers or symbols are assigned to attributes of entities in
      the real world in such a way as to describe them according to clearly defined rules.
22                                                              Empirical Research in Software Engineering
Coupling?
<8 >8
Low High
FIGURE 1.11
Example of classification process.
Introduction                                                                                                23
                                                                     Validation
                                                                        data
Predicts
New data
FIGURE 1.12
Steps in classification process.
analysis can be categorized by identifying patterns from the textual information. This can
be achieved by reading and analyzing texts and deriving logical categories. This will help
organize data in the form of categories. For example, answers to the following questions
are presented in the form of categories.
Text mining is another way to process qualitative data into useful form that can be used
for further analysis.
                                                        Experiment
                     Causes                               process                          Effect
             (independent variables)                                                 (dependent variable)
FIGURE 1.13
Independent and dependent variables.
24                                                  Empirical Research in Software Engineering
are input variables that are manipulated or controlled by the researcher to measure the
response of the dependent variable.
  The dependent variable (or response variable) is the output produced by analyzing the
effect of the independent variables. The dependent variables are presumed to be influenced
by the independent variables. The independent variables are the causes and the depen-
dent variable is the effect. Usually, there is only one dependent variable in the research.
Figure 1.13 depicts that the independent variables are used to predict the outcome variable
following a systematic experimental process.
  Examples of independent variables are lines of source code, number of methods, and
number of attributes. Dependent variables are usually measures of software quality attri-
butes. Examples of dependent variable are effort, cost, faults, and productivity. Consider
the following research question:
  Do software metrics have an effect on the change proneness of a module?
  Here, software metrics are the independent variables and change proneness is the
dependent variable.
  Apart from the independent variables, unknown variables or confounding variables
(extraneous variables) may affect the outcome (dependent) variable. Randomization can
nullify the effect of confounding variables. In randomization, many replications of the
experiments are executed and the results are averaged over multiple runs, which may
cancel the effect of extraneous variables in the long course.
Open source software is usually a freely available software, developed by many develop-
ers from different places in a collaborative manner. For example, Google Chrome, Android
operating system, and Linux operating system.
Introduction                                                                               25
FIGURE 1.14
(a) Within-company versus (b) cross-company prediction.
26                                                      Empirical Research in Software Engineering
     1. Goal
     2. Question
     3. Metric
In GQM method, measurement is goal-oriented. Thus, first the goals need to be defined
that can be measured during the software development. The GQM method defines goals
that are transformed into questions and metrics. These questions are answered later to
determine whether the goals have been satisfied or not. Hence, GQM method follows
top-down approach for dividing goals into questions and mapping questions to metrics,
and follows bottom-up approach by interpreting the measurement to verify whether the
goals have been satisfied. Figure 1.15 presents the hierarchical view of GQM framework.
The figure shows that the same metric can be used to answer multiple questions.
  For example, if the developer wants to improve the defect-correction rate during the
maintenance phase. The goal, question, and associated metrics are given as:
The goals are defined as purposes, objects, and viewpoints (Basili et al. 1994). In the above
example, purpose is “to improve,” object is “defects,” and viewpoint is “project manager.”
             TABLE 1.4
             Difference between Parametric and Nonparametric Tests
                                            Parametric Tests       Nonparametric Tests
Goal
Metric 3
Metric 4
FIGURE 1.15
Framework of GQM.
Planning Definition
                                                               Data
                                    Interpretation
                                                             collection
                    • Answering
                      questions
                    • Measurement                                         • Collecting data
                    • Goal evaluated
FIGURE 1.16
Phases of GQM.
  Figure 1.16 presents the phases of the GQM method. The GQM method has the following
four phases:
  • Planning: In the first phase, the project plan is produced by recognizing the basic
    requirements.
  • Definition: In this phase goals, questions, and relevant metrics are defined.
  • Data collection: In this phase actual measurement data is collected.
  • Interpretation: In the final phase, the answers to the questions are provided and
    the goal’s attainment is verified.
28                                               Empirical Research in Software Engineering
Exercises
     1.1 What is empirical software engineering? What is the purpose of empirical soft-
        ware engineering?
     1.2 What is the importance of empirical studies in software engineering?
     1.3 Describe the characteristics of empirical studies.
     1.4 What are the five types of empirical studies?
     1.5 What is the importance of replicated and repeated studies in empirical software
        engineering?
     1.6 Explain the difference between an experiment and a case study.
     1.7 Differentiate between quantitative and qualitative research.
     1.8 What are the steps involved in an experiment? What are characteristics of a good
        experiment?
Introduction                                                                                 29
  1.9 What are ethics involved in a research? Give examples of unethical research.
  1.10 Discuss the following terms:
     a. Hypothesis testing
     b. Ethics
     c. Empirical research
     d. Software quality
  1.11 What are systematic reviews? Explain the steps in systematic review.
  1.12 What are the key issues involved in empirical research?
  1.13 Compare and contrast classification and prediction process.
  1.14 What is GQM method? Explain the phases of GQM method.
  1.15 List the importance of empirical research from the perspective of software indus-
     tries, academicians, and researchers.
  1.16 Differentiate between the following:
     a. Parametric and nonparametric tests
     b. Independent, dependent and confounding variables
     c. Quantitative and qualitative data
     d. Within-company and cross-company analysis
     e. Proprietary and open source software
Further Readings
Kitchenham et  al. effectively provides guidelines for empirical research in software
engineering:
Juristo and Moreno explain a good number of concepts of empirical software engineering:
  N. Mays, and C. Pope, “Qualitative research: Rigour and qualitative research,” British
     Medical Journal, vol. 311, no. 6997, pp. 109–112, 1995.
  A. Strauss, and J. Corbin, Basics of Qualitative Research: Techniques and Procedures for
     Developing Grounded Theory, Sage Publications, Thousand Oaks, CA, 1998.
30                                                  Empirical Research in Software Engineering
The detail about ethical issues for empirical software engineering is presented in:
Authors present detailed practical guidelines on the preparation, conduct, design, and
reporting of case studies of software engineering in:
The following research paper provides detailed explanations about software quality
attributes:
     A. J. Albrecht, and J. E. Gaffney, “Software function, source lines of code, and devel-
        opment effort prediction: A software science validation,” IEEE Transactions on
        Software Engineering, vol. 6, pp. 639–648, 1983.
The following research papers provide a brief knowledge of quantitative and qualitative
data in software engineering:
     A. Bryman, and B. Burgess, Analyzing Qualitative Data, Routledge, New York, 2002.
Introduction                                                                                  31
Basili explain the major role to controlled experiment in software engineering field in:
The concept of proprietary, open source, and university software are well explained in the
following research paper:
The book by Solingen and Berghout is a classic and a very useful reference, and it gives
detailed discussion on the GQM methods:
  R. Prieto-Díaz, “Status report: Software reusability,” IEEE Software, vol. 10, pp. 61–66,
     1993.
2
Systematic Literature Reviews
Review of existing literature is an essential step before beginning any new research.
Systematic reviews (SRs) synthesize the existing research work in such a manner that can be
analyzed, assessed, and interpreted to draw meaningful conclusions. The aim of conducting
an SR is to gather and interpret empirical evidence from the available research with respect
to formed research questions. The benefit of conducting an SR is to summarize the existing
trends in the available research, identify gaps in the current research, and provide future
guidelines for conducting new research. The SRs also provide empirical evidence in sup-
port or opposition of a given hypothesis. Hence, the author of the SR must make all the
efforts to provide evidence that support or does not support a given research hypothesis.
  In this chapter, guidelines for conducting SRs are given for software engineering research-
ers and practitioners. The steps to be followed while conducting an SR including planning,
conducting and reporting phases are described. The existing high-quality reviews in the
areas of software engineering are also presented in this chapter.
                                                                                                     33
34                                                             Empirical Research in Software Engineering
      TABLE 2.1
      Comparison of Systematic Reviews and Literature Survey
      S. No.               Systematic Review                                  Literature Survey
      1        The goal is to identify best practices,          The goal is to classify or categorize existing
                strengths and weaknesses of specific             literature.
                techniques, procedures, tools, or methods
                by combining information from various
                studies.
      2        Focused on research questions that assess        Provides an introduction of each paper in
                the techniques under investigation.              literature based on the identified area.
      3        Provides a detailed review of existing           Provides a brief overview of existing
                literature.                                      literature.
      4        Extracts technical and useful metadata           Extracts general research trends from the
                from the contents.                               studies.
      5        Search process is more stringent such that it    Search process is less stringent.
                involves searching references or
                contacting researchers in the field.
      6        Strong assessment of quality is necessary.       Strong assessment of quality is not necessary.
      7        Results are based on high-quality evidence       Results only provide summary of existing
                with the aim to answer research questions.       literature.
      8        Often uses statistics to analyze the results.    Does not use statistics to analyze the results.
SRs summarize high-quality research on a specific area. They provide the best available
evidence on a particular technique or technology and produce conclusions that can be
used by the software practitioners and researchers to select the best available techniques
or methodologies. The studies included in the review are known as primary studies and
the SRs are known as secondary studies. Table  2.1  presents the summary of difference
between SR and literature survey.
     1. It selects high-quality research papers and studies that are relevant, important,
        and essential, which are summarized in the form of one review paper.
     2. It performs a systematic search by forming a search strategy to identify primary
        studies from the digital libraries. The search strategy is documented so that the
        readers can analyze the completeness of the process and repeat the same.
     3. It forms a valid review protocol and research questions that address the issues to
        be answered in the SR.
     4. It clearly summarizes the characteristics of each selected study, including aims,
        techniques, and methods used in the studies.
     5. It consists of a justified quality assessment criteria for inclusion and exclusion of
        the studies in the SR so that the effectiveness of each study can be determined.
     6. It uses a number of presentation tools for reporting the findings and results of the
        selected studies to be included in the SR.
     7. It identifies gaps in the current findings and highlights future directions.
Systematic Literature Reviews                                                                35
The procedure followed in performing the SR is given by Kitchenham et al. (2007). The
process is depicted in Figure 2.1. In the first step, the need for the SR is examined and in the
second step the research questions are formed that address the issues to be answered in
the review. Thereafter, the review protocol is developed that includes the following steps:
search strategy design, study selection criteria, study quality assessment criteria, data
extraction process, and data synthesis process.
   The formation of review protocol consists of a series of stages. In the first step, the
search strategy is formed, including identification of search terms and selection of
sources to be searched to identify the primary studies. The next step involves deter-
mination of relevant studies by setting the inclusion and exclusion criteria for select-
ing review studies. Thereafter, quality assessment criteria are identified by forming the
quality assessment questionnaire to analyze and assess the studies. The second to last
stage involves the design of data extraction forms to collect the required information
to answer the research questions, and in the final stage, methods for data synthesis are
devised. Development of review protocol is an important step in an SR as it reduces the
possibility and risk of research bias in the SR. Finally, in the planning stage, the review
protocol is evaluated.
   The steps planned in the first stage are actually performed in the conducting stage that
includes actual collection of relevant studies by applying first the search strategy and then
the inclusion and exclusion criteria. Each selected study is ranked according to the qual-
ity assessment criteria, and the data extraction and data synthesis steps are followed from
only the selected high-quality primary studies. In the final phase, the results of the SR are
reported. This step further involves examining, presenting, and verifying the results.
36                                                            Empirical Research in Software Engineering
                              Planning
                                           2. Identify research questions
                             the review
                             Conducting
                                             7. Study quality assessment
                             the review
8. Data extraction
9. Data synthesis
FIGURE 2.1
Systematic review process.
The above stages defined in the SR are iterative and not sequential. For example, the criteria
for inclusion and exclusion of primary studies must be developed prior to collecting the
studies. The criteria may be refined in the later stages.
   1. How many primary studies are available in the software engineering context?
   2. What are the strength and weaknesses of the existing SR (if any) in the software
      engineering context?
   3. What is the practical relevance of the proposed SR?
   4. How will the proposed SR guide practitioners and researchers?
   5. How can the quality of the proposed SR be evaluated?
Checklist is the most common mechanism used for reviewing the quality of the existing SR
in the same area. It may also identify the flaws in the existing SR. A checklist may consist
of a list of questions to determine the effectiveness of the existing SR. Table 2.2 shows an
example of the checklist to assess the quality of an SR. The checklist consists of questions
pertaining to the procedures and processes followed during an SR. The existing studies
may be rated on a scale of 1–12  so that the quality of each study can be determined.
38                                                      Empirical Research in Software Engineering
                TABLE 2.2
                Checklist for Evaluating Existing SR
                S. No.                              Questions
We may establish a threshold value to identify quality level of the study. If the rating of
the existing SR goes below the established threshold value, the quality of the study may be
considered as not acceptable and a new SR on the same topic may be conducted.
  Thus, if an SR in the same domain with similar aims is located but it was conducted a
long time ago, then a new SR adding current studies may be justified. However, if the exist-
ing SR is still relevant and is of high quality, then a new SR may not be required.
     • Which areas have already been explored in the existing reviews (if any)?
     • Which areas are relevant and need to be explored/answered during the
       proposed SR?
     • Are the questions important to the researchers and software practitioners?
     • Will the questions assess any similarities in the trends or identify any deviation
       from the existing trends?
Systematic Literature Reviews                                                                                 39
TABLE 2.3
Research Questions for SRML Case Study (Malhotra 2015)
RQ#                   Research Questions                                        Motivation
RQ1       Which ML techniques have been used for SFP?      Identify the ML techniques commonly being used
                                                            in SFP.
RQ2       What kind of empirical validation for            Assess the empirical evidence obtained.
           predicting faults is found using the ML
           techniques found in RQ1?
RQ2.1     Which techniques are used for subselecting       Identify techniques reported to be appropriate for
           metrics for SFP?                                 selecting relevant metrics.
RQ2.2     Which metrics are found useful for SFP?          Identify metrics reported to be appropriate for SFP.
RQ2.3     Which metrics are found not useful for SFP?      Identify metrics reported to be inappropriate for SFP.
RQ2.4     Which data sets are used for SFP?                Identify data sets reported to be appropriate for SFP.
RQ2.5     Which performance measures are used              Identify the measures which can be used for
           for SFP?                                         assessing the performance of the ML techniques
                                                            for SFP.
RQ3       What is the overall performance of the           Investigate the performance of the ML techniques
           ML techniques for SFP?                           for SFP.
RQ4       Whether the performance of the ML                Compare the performance of the ML techniques
           techniques is better than statistical            over statistical techniques for SFP.
           techniques?
RQ5       Are there any ML techniques that significantly   Assess the performance of the ML techniques over
           outperform other ML techniques?                  other ML techniques for SFP.
RQ6       What are the strengths and weaknesses of the     Determine the conditions that favor the use of ML
           ML techniques?                                   techniques.
The following questions address various issues related to SR on the use of the ML
techniques for SFP:
Table  2.3  presents the research questions along with the motivation for SRML. While
forming the research questions, the interest of the researchers must be kept in mind.
For example, for Masters and PhD student thesis, it is necessary to identify the research
relevant to the proposed work so that the current body of knowledge can be formed and
the proposed work can be established.
                                       Development of search
                                       strategy      Search terms
Digital libraries
                                          Construction of quality
                                           assessment checklists
                                           Development of data
                                            extraction forms
                                           Identification of study
                                            synthesis techniques
FIGURE 2.2
Steps involved in a review protocol.
In this step, the planning of the search strategy, study selection criteria, quality assessment
criteria, data extraction, and data synthesis is carried out.
  The purpose of the review must state the options researchers have when deciding which
technique or method to adopt in practice. The review protocol is established by frequently
holding meetings and group discussions in the group formed comprising of preferably
senior members having experience in the area. Hence, this step is iterative and is defined
and refined in various iterations. Figure 2.2 shows the steps involved in the development
of review protocol.
  The first step involves formation of search terms, selection of digital libraries that must
be searched, and refinement of search terms. This step allows identification of primary
studies that will address the research questions. The initial search terms may be identified
by the following steps to form the best suited search string:
Thereafter, the sophisticated search terms are formed by incorporating alternative terms
and synonyms using Boolean expression “OR” and combining main search terms using
“AND.” The following general search terms were used for identification of primary studies
in SRML case study:
   Software AND (fault OR defect OR error) AND (proneness OR prone OR prediction OR
probability) AND (regression OR ML OR soft computing OR data mining OR classifica-
tion OR Bayesian network OR neural network [NN] OR decision tree OR support vector
machine OR genetic algorithms OR random forest [RF]).
Systematic Literature Reviews                                                                41
   After identifying the search terms, the relevant and important digital portals are to be
selected. The portals publishing the journal articles are the right place to search for the rel-
evant studies. The bibliographic databases are also common place of search as they provide
title, abstract, and publication source of the study. The selection of digital libraries/portals
is very essential, as the number of studies found is dependent on it. Generally, several
libraries must be searched to find all the relevant studies that cover the research questions.
The selection must not be restricted by the availability of digital portals at the home uni-
versities. For example, the following seven electronic digital libraries may be searched for
the identification of primary studies:
   1.   IEEE Xplore
   2.   ScienceDirect
   3.   ACM Digital Library
   4.   Wiley Online Library
   5.   Google Scholar
   6.   SpringerLink
   7.   Web of Science
The reference section of the relevant studies must also be examined/scanned to identify the
other relevant studies. The external experts in the areas may also be contacted in this regard.
  The next step is to establish the inclusion and exclusion criteria for the SR. The inclusion
and exclusion criteria allow the researchers to decide whether to include or exclude the
study in the SR. The inclusion and exclusion criteria are based on the research questions.
For example, the studies that use data collected from university software developed by
student programmers or experiments conducted by students may be excluded from the
SR. Similarly, the studies that do not perform any empirical analysis on the techniques
and technologies that are being examined in the SR may be excluded. Hence, the inclusion
criteria may be specific to the type of tool, technique, or technology being explored in the
SR. The data on which the study was conducted or the type of empirical data being used
(academia or industry/small, medium, or large sized) may also affect the inclusion criteria.
  The following inclusion and exclusion criteria were formed in SRML review:
  Inclusion criteria:
     • Empirical studies using the ML techniques for SFP.
     • Empirical studies combining the ML and non-ML techniques.
     • Empirical studies comparing the ML and statistical techniques.
  Exclusion criteria:
     • Studies without empirical analysis or results of use of the ML techniques for SFP.
     • Studies based on fault count as dependent variable.
     • Studies using the ML techniques in context other than SFP.
     • Similar studies, that is, studies by the same author in conference as well-
        extended version in journal. However, if the results were different in both the
        studies, they were retained.
     • Studies that only use statistical techniques for SFP.
     • Review studies.
42                                                       Empirical Research in Software Engineering
The above inclusion and exclusion criteria were applied on each relevant study tested
by two researchers independently, and they reached a common decision after detailed
discussion. In case of any doubt, full text of a study was reviewed and final decision
regarding the inclusion/exclusion of the study was made. Hence, more than one reviewer
should check the relevance of a study based on the inclusion and exclusion criteria before
a final decision for inclusion or exclusion of a study is made.
  The third step in development of a review protocol is to form the quality questionnaire
for assessing the relevance and strength of the primary studies. The quality assessment is
necessary to investigate and analyze the quality and determine the strength of the stud-
ies to be included in final synthesis. It is necessary to limit the bias in the SR and provide
guidelines for interpretation of the results.
  The assessment criteria must be based on the relevance of a particular study to the
research questions and the quality of the processes and methods used in the study.
In  addition, quality assessment questions must focus on experimental design, appli-
cability of results, and interpretation of results. Some studies may meet the inclusion
criteria but may not be relevant with respect to the research design, the way in which
data is collected, or may not justify the use of various techniques analyzed. For example,
a study on fault proneness may not perform comparative analysis of ML and non-ML
techniques.
  The quality questionnaire must be constructed by weighing the studies with numerical val-
ues. Table  2.4  presents the quality assessment questions for any SR. The studies are rated
according to each question and given a score of 1 (yes) if it is satisfactory, 0.5 (partly) if it is
moderately satisfactory, and a score of 0 (no) if it is unsatisfactory. The final score is obtained
after adding the values assigned to each question. A study could have a maximum score of
10 and a minimum score of 0, if ranked on the basis of quality assessment questions formed
in Table 2.4. The studies with low-quality scores may be excluded from the SR or final list of
primary studies.
  In addition to the questions given in Table 2.4, the following four additional questions
were formed in SRML review (see Table 2.5). Hence, a researcher may create specific qual-
ity assessment questions with respect to the SR.
  The quality score along with the level assigned to the study in the example case study
SRML taken in this chapter is given in Table 2.6. The reviewers must decide a threshold
value for excluding a study from the SR. For example, studies with quality score >9 were
considered for further data extraction and synthesis in SRML review.
            TABLE 2.4
            Quality Assessment Questions
            Q#                     Quality Questions                    Yes   Partly   No
           TABLE 2.5
           Additional Quality Assessment Questions for SRML Review
           Q#                   Quality Questions                   Yes   Partly   No
                              TABLE 2.6
                              Quality Scores for Quality Assessment
                              questions given in Table 2.4
                              Quality Score
                              9 ≤ score ≤ 10                Very high
                              8 ≤ score ≤ 6                 High
                              5 ≤ score ≤ 4                 Medium
                              0 ≤ score ≤ 3                 Low
  The next step is to construct data extraction forms that will help to summarize the infor-
mation extracted from the primary studies in view of the research questions. The details of
which specific research questions are answered by specific primary study are also present
in the data extraction form. Hence, one of the aim of the data extraction is to find which
primary study addresses which research question for a given study. In many cases, the
data extraction forms will extract the numeric data from the primary studies that will
help to analyze the results obtained from these primary studies. The first part of the data
extraction card summarizes the author name, title of the primary study, and publishing
details, and the second part of the data extraction form contains answers to the research
questions extracted from a given primary study. For example, the data set details, indepen-
dent variables (metrics), and the ML techniques are summarized for the SRML case study
(see Figure 2.3).
  A team of researchers must collect the information from the primary studies. However,
because of the time and resource constraints at least two researchers must evaluate the
primary studies to obtain useful information to be included in the data extraction card.
The results from these two researchers must then be matched and if there is any disagree-
ment between them, then other researchers may be consulted to resolve these disagree-
ments. The researchers must clearly understand the research questions and the review
protocol before collecting the information from the primary studies. In case of Masters
and PhD students, their supervisors may collect information from the primary studies and
then match their results with those obtained by the students.
  The last step involves identification of data synthesis tools and techniques to summarize
and interpret the information obtained from the primary studies. The basic objective while
synthesizing data is to accumulate and combine facts and figures obtained from the selected
primary studies to formulate a response to the research questions. Tables and charts may be
used to highlight the similarities and differences between the primary studies. The following
44                                                       Empirical Research in Software Engineering
                                                Section I
                          Reviewer name
                          Author name
                          Title of publication
                          Year of publication
                          Journal/conference name
                          Type of study
                                               Section II
                          Data set used
                          Independent variables
                          Feature subselection methods
                          ML techniques used
                          Performance measures used
                          Values of accuracy measures
                          Strengths of ML techniques
                          Weaknesses of ML techniques
FIGURE 2.3
Data extraction form.
steps need to be followed before deciding the tools and methods to be used for depicting the
results of the research questions:
The effects of the results (performance measures) obtained from the primary studies may
be analyzed using statistical measures such as mean, median, and standard deviation (SD).
  In addition, the outliers present in the results may be identified and removed using
various methods such as box plots. We must also use various tools such as bar charts,
scatter plots, forest plots, funnel plots, and line charts to visually present the results of
the primary studies in the SR. The aggregation of the results from various studies will
allow researchers to provide strong and well-acceptable conclusions and may give strong
support in proving a point. The data obtained from these studies may be quantitative
(expressed in the form of numerical measures) or qualitative (expressed in the form of
descriptive information/texts). For example, the values of performance measures are
quantitative in nature, and the strengths and weaknesses of the ML techniques are quali-
tative in nature.
  A detailed description of the methods and techniques that are identified to represent
answers to the established research questions in the SRML case study for SFP using the
ML techniques are stated as follows:
     • To summarize the number of ML techniques used in primary studies the SRML case
       study will use a visualization technique, that is, a line graph to depict the number of
       studies pertaining to the ML techniques in each year, and presented a classification
       taxonomy of various ML techniques with their major categories and subcategories.
Systematic Literature Reviews                                                                  45
       The case study also presented a bar chart that shows the total number of studies
       conducted for each main category of the ML technique and pie charts that depict the
       distribution of selected studies into subcategories for each ML category.
  •    The case study will use counting method to find the feature subselection tech-
       niques, useful and not useful metrics, and commonly used data sets for SFP. These
       subparts will be further aided by graphs and pie charts that showcase the distribu-
       tion of selected primary studies for metrics usage and data set usage. Performance
       measures will be summarized with the help of a table and a graph.
  •    The comparison of the result of the primary studies is shown with the help of a table
       that compares six performance measures for each ML technique. The box plots will be
       constructed to identify extreme values corresponding to each performance measure.
  •    A bar chart will be created to depict and analyze the comparison between the
       performance of the statistical and ML techniques.
  •    The strengths and weaknesses of different ML techniques for SFP will be sum-
       marized in tabular format.
Finally, the review protocol document may consist of the following sections:
  1.   Development of appropriate search strings that are derived from research questions
  2.   Adequacy of inclusion and exclusion criteria
  3.   Completeness of quality assessment questionnaire
  4.   Design of data extraction forms that address various research questions
  5.   Appropriateness of data analysis procedures
Masters and PhD students must present the review protocol to their supervisors for the
comments and analysis.
46                                                      Empirical Research in Software Engineering
                          TABLE 2.7
                          Contingency Table for Binary Variable
                                      Outcome Present         Outcome Absent
                                                    Risk 1
                                           RR =
	                                                   Risk 2
         ii. Odds ratio (OR): It measures the strength of the presence or absence of an
             event. It is the ratio of odds of an outcome in two groups. It is desired that
             the value is greater than one. The OR is defined as:
                                              a11          a21
                                =
                                Odds1         =   , Odds 2
	                                             a12          a22
                                                    Odds1
                                           OR =
	                                                   Odds 2
                          TABLE 2.8
                          Example Contingency Table for Binary
                          Variable
                                           Faulty      Not Faulty    Total
                          Coupled            31                4      35
                          Not coupled         4               99     103
                          Total              35              103     138
48                                                      Empirical Research in Software Engineering
     b. For continuous variables (variables that do not have any specified range), the
        following commonly used effects are of interest:
         i. Mean difference: This measure is used when a study reports the same type
            of outcome and measures them on the same scale. It is also known as “dif-
            ference of means.” It represents the difference between the mean value of
            each group (Kictenham 2007). Let X g1	 and X g2 	 be the mean of two groups
            (say g1 and g2), which is defined as:
                                   Mean difference = X g1 − X g2
	
         ii. Standardized mean difference: It is used when a study reports the same
             type of outcome measure but measures it in different ways. For example,
             the size of a program may be measured by function points or lines of code.
             Standardized mean difference is defined as the ratio of difference between
             the means in two groups to the SD of the pooled outcome. Let SDpooled be
             the SD pooled across groups, SDg1 be the SD of one group, SDg2 be the SD of
             another group, and ng1 and ng2 be the sizes of the two groups. The formula
             for standardized mean difference is given below:
                                                                   X g1 − X g2
                          Standardized mean difference =
                                                                   SD pooled
	
         where
                                         (20 − 1) × 52 + (20 − 1) × 4 2
                          SD pooled =                                   = 4.527
	                                                20 + 20 − 2
                                                               110 − 100
                      Standardized mean difference =                     = 2.209
	                                                                4.527
     Example 2.1
     Consider the following data (refer Table 2.9) consisting of an attribute data class that can
     have binary values true or false, where true represents that the class is data intensive
     (number of declared variables is high) and false represents that the class is not data
     intensive (number of declared variables is low). The outcome variable is change that
     contains “yes” and “no,” where “yes” represents presence of change and “no” repre-
     sents absence of change.
       Calculate RR, OR, and risk difference.
     Solution
      The 2 × 2 contingency table is given in Table 2.10.
                                       6                    1
                           Risk 1 =       = 0.75, Risk 2 =      = 0.142
	                                     6+2                  1+ 6
Systematic Literature Reviews                                                               49
                                    TABLE 2.9
                                    Sample Data
                                    Data Class            Change
                                    False                    No
                                    False                    No
                                    True                     Yes
                                    False                    Yes
                                    True                     Yes
                                    False                    No
                                    False                    No
                                    True                     Yes
                                    True                     No
                                    False                    No
                                    False                    No
                                    True                     Yes
                                    True                     No
                                    True                     Yes
                                    True                     Yes
                  TABLE 2.10
                  Contingency Table for Example Data Given in Table 2.9
                  Data Class       Change Present    Change Not Present    Total
                  True                      6                2               8
                  False                     1                6               7
                  Total                     7                8              15
                                                  0.75
                                       =
                                       RR        = 5.282
	                                                0.142
                                6           1
                          =
                          =
                          Odds1 = 3=
                                   , Odds 2   0.17
	                               2           6
                                                   3
                                            =
                                            OR   = 17.647
	                                                0.17
                          TABLE 2.11
                          Results of Five Studies
                                                             Standard
                          Study                     AUC        Error          95% CI
studies, whereas in the random effects model there are varied effects in the studies. When
heterogeneity is found in the effects, then the random effects model is preferred.
  Table 2.11 presents the AUC computed from the ROC analysis, standard error, and upper
bound and lower bound of CI. Figure 2.4 depicts the forest plot for five studies using AUC
and standard error. Each line represents each study in the SR. The boxes (black-filled
squares) depict the weight assigned to each study. The weight is represented as the inverse
of the standard error. The lesser the standard error, the more weight is assigned to the
study. Hence, in general, weights can be based on the standard error and sample size. The
CI is represented through length of lines. The diamond represents summary of combined
effects of all the studies, and the edges of the diamond represent the overall effect. The
results show the presence of heterogeneity, hence, random effects models is used to ana-
lyze the overall accuracy in terms of AUC ranging from 0.69 to 0.85.
Study 1
Study 2
Study 3
Study 4
Study 5
FIGURE 2.4
Forest plots.
Systematic Literature Reviews                                                                  51
0.00
                          0.05
         Standard error
0.10
0.15
                          0.20
                             −2.0   −1.5   −1.0   −0.5      0.0        0.5   1.0   1.5   2.0
                                                    Area under the curve
FIGURE 2.5
Funnel plot.
           IEEE Xplore
                                                                Basic search
           ScienceDirect
                                     Select number of
                                     relevant studies
           ACM Digital
             Library
                                                                Initial studies           Reference
FIGURE 2.6
Search process.
the reference section of the relevant papers must also be included. The multiple copies
of the same publications must be removed and the collected publications must be stored
in a reference management system, such as Mendeley and JabRef. The list of journals
and conferences in which the primary studies have been published must be created.
Table 2.12 shows some popular journals and conferences in software engineering.
         TABLE 2.12
         Popular Journals and Conferences on Software Engineering
         Publication Name                                                                   Type
Table 2.13 shows an example of data extraction form collected for SRML case study using
research results given by Dejager et al. (2013). A similar form can be made for all the pri-
mary studies.
TABLE 2.13
Example Data Extraction Form
Section I
Reviewer name                             Ruchika Malhotra
Author name                               Karel Dejaeger, Thomas Verbraken, and Bart Baesens
Title of publication                      Toward Comprehensible Software Fault Prediction Models Using
                                           Bayesian Network Classifiers
Year of publication                       2013
Journal/conference name                   IEEE Transactions on Software Engineering
Type of the study                         Research paper
Section II
Data set used                               NASA data sets (JM1, MC1, KC1, PC1, PC2, PC3, PC4, PC5), Eclipse
Independent variables                       Static code measures (Halstead and McCabe)
Feature subselection method                 Markov Blanket
ML techniques used                          Naïve Bayes, Random Forest
Performance measures used                   AUC, H-measure
Values of accuracy measures (AUC)           Data              RF                    NB
                                            JM1              0.74         0.74      0.69        0.69
                                            KC1              0.82         0.8       0.8         0.81
                                            MC1              0.92         0.92      0.81        0.79
                                            PC1              0.84         0.81      0.77        0.85
                                            PC2              0.73         0.66      0.81        0.79
                                            PC3              0.82         0.78      0.77        0.78
                                            PC4              0.93         0.89      0.79        0.8
                                            PC5              0.97         0.97      0.95        0.95
                                            Ecl 2.0a         0.82         0.82      0.8         0.79
                                            Ecl 2.1a         0.75         0.73      0.74        0.74
                                            Ecl 3.0a         0.77         0.77      0.76        0.86
Strengths (Naïve Bayes)                     It is easy to interpret and construct
                                            Computationally efficient
Weaknesses (Naïve Bayes)                    Performance of model is dependent on attribute selection
                                              technique used
                                            Unable to discard irrelevant attributes
                           TABLE 2.14
                           Distribution of Studies Across ML Techniques
                           Based on Classification
                           Method                     # of Studies      Percent
                                        Miscellaneous
                                            15%
                                   Hybrid
                                                             OO 31%
                                    7%
Procedural 47%
FIGURE 2.7
Primary study distribution according to the metrics used.
the results section by using visual diagrams and tables. For example, Table  2.14  pres-
ents the number of studies covering various ML techniques. There are various ML tech-
niques available in the literature such as decision tree, NNs, support vector machine,
and bayesian learning. The table shows that 31 studies analyzed decision tree techniques,
17  studies analyzed NN techniques, 18 studies examined support vector machines, and
so on. Similarly, the software metrics are divided into various categories in the SRML case
study—OO, procedural, hybrid, and miscellaneous. Figure 2.7 depicts the percentage of
studies examining each category of metrics, such as 31% of studies examine OO metrics.
The pie chart shows that the procedural metrics are most commonly used metrics with
47% of the total primary studies.
  The results of the ML techniques that were assessed in at least 5 out of 64 selected
primary studies are provided using frequently used performance measures in the
64  primary studies. The results showed that accuracy, F-measure, precision, recall,
and AUC are the most frequently used performance measures in the selected primary
studies. Tables 2.15 and 2.16 present the minimum, maximum, mean, median, and SD
values for the selected performance measures. The results are shown for RF and NN
techniques.
                    TABLE 2.15
                    Results of RF Technique
                    RF           Accuracy     Precision      Recall   AUC    Specificity
                   TABLE 2.16
                   Results of NN Technique
                   MLP          Accuracy      Precision     Recall     ROC      Specificity
     • Journal or conferences
     • Technical report
     • PhD thesis
The detailed reporting of the results of the SR is very important and critical so that academi-
cians can have an idea about the quality of the study. The detailed reporting consists of the
review protocol, inclusion/exclusion criteria, list of primary studies, list of rejected studies,
quality scores assigned to studies, and raw data pertaining to the primary studies, for example,
number of research questions addressed by the primary studies and so on should be reported.
The review results are generally longer than the normal original study. However, the journals
may not permit publication of long SR. Hence, the details may be kept in appendix and stored
in electronic form. The details in the form of technical report may also be published online.
  Table 2.17 presents the format and contents of the SR. The table provides the contents
along with its detailed description. The strengths and limitations of the SR must also be
discussed along with the explanation of its effect on the findings.
TABLE 2.17
Format of an SR Report
Section      Subsections             Description                                 Comments
Title                      –                                 The title should be short and informative.
Authors                    –                                 –
 Details
Abstract     Background    What is the relevance and          It allows the researchers to gain insight about the
                            importance of the SR?               importance, addressed areas, and main findings
             Method        What are the tools and techniques of the study.
                            used to perform the SR?
             Results       What are the major findings
                            obtained by the SR?
             Conclusions   What are the main implications
                            of the results and guidelines for
                            the future research?
                                                                                                     (Continued)
Systematic Literature Reviews                                                                                     57
TABLE 2.18
Systematic Reviews in Software Engineering
                                                          Data
                          Research       Study    QA    Synthesis
Authors         Year       Topics         Size   Used   Methods              Conclusions
Kitchenham      2007   Cost estimation    10     Yes    Tables        • Strict quality control on
 et al.                 models,                                         data collection is not
                        cross-company                                   sufficient to ensure that a
                        data, within-                                   cross-company model
                        company data                                    performs as well as a
                                                                        within-company model.
                                                                      • Studies where within-
                                                                        company predictions were
                                                                        better than cross-company
                                                                        predictions employed
                                                                        smaller within-company
                                                                        data sets, smaller number
                                                                        of projects in the cross-
                                                                        company models, and
                                                                        smaller databases.
Jørgensen and   2007   Cost estimation    304    No     Tables        • Increase the breadth of the
 Shepperd                                                               search for relevant studies.
                                                                      • Search manually for
                                                                        relevant papers within a
                                                                        carefully selected set of
                                                                        journals.
                                                                      • Conduct more studies on
                                                                        the estimation methods
                                                                        commonly used by the
                                                                        software industry.
                                                                      • Increase the awareness of
                                                                        how properties of the data
                                                                        sets impact the results
                                                                        when evaluating
                                                                        estimation methods.
Stol et al.     2009   Open source        63     No     Pie charts,   • Most research is done on
                        software                         bar            OSS communities.
                        (OSS)–related                    charts,      • Most studies investigate
                        empirical                        tables         projects in the “system”
                        research                                        and “internet” categories.
                                                                      • Among research methods
                                                                        used, case study, survey,
                                                                        and quantitative analysis
                                                                        are the most popular.
                                                                                         (Continued)
Systematic Literature Reviews                                                                      59
Exercises
     2.1 What is an SR? Why do we really need to perform SR?
     2.2 a. Discuss the advantages of SRs.
        b. Differentiate between a survey and an SR.
     2.3 Explain the characteristics and importance of SRs.
Systematic Literature Reviews                                                                61
                        TABLE 2.12.1
                        Contingency Table from Study on change
                        Prediction
                                       Change   Not Change
                                        Prone     Prone       Total
                        Coupled          14          12          26
                        Not coupled      16          22          38
                        Total            30          34          64
  2.4 a. What are the search strategies available for selecting primary studies? How will
     you select the digital portals for searching primary studies?
     b. What is the criteria for forming a search string?
  2.5 What is the criteria for determining the number of researchers for conducting the
     same steps in an SR?
  2.6 What is the purpose of quality assessment criteria? How will you construct the
     quality assessment questions?
  2.7 Why identification of the need for an SR is considered the most important step in
     planning the review?
  2.8 How will you decide on the tools and techniques to be used during the data
     synthesis?
  2.9 What is publication bias? Explain the purpose of funnel plots in detecting
     publication bias?
  2.10 Explain the steps in SRs with the help of an example case study.
  2.11 Define the following terms:
     a. RR
     b. OR
     c. Risk difference
     d. Standardized mean difference
     e. Mean difference
  2.12 Given the contingency table for all classes that are coupled or not coupled in a
     software with respect to a dichotomous variable change proneness, calculate the
     RR, OR, and risk difference (Table 2.12.1).
Further Readings
A classic study that describes empirical results in software engineering is given by:
A detailed survey that summarizes approaches that mine software repositories in the
context of software evolution is given in:
The guidelines for preparing the review protocols are given in:
For further understanding on forest and funnel plots, see the following publications:
  M. Shepperd, D. Bowes, and T. Hall, “Researcher bias: The use of machine learning
    in software defect prediction,” IEEE Transactions on Software Engineering, vol. 40,
    no. 6, pp. 603–616, 2014.
3
Software Metrics
Software metrics are used to assess the quality of the product or process used to build it.
The metrics allow project managers to gain insight about the progress of software and
assess the quality of the various artifacts produced during software development. The
software analysts can check whether the requirements are verifiable or not. The metrics
allow management to obtain an estimate of cost and time for software development. The
metrics can also be used to measure customer satisfaction. The software testers can mea-
sure the faults corrected in the system, and this decides when to stop testing.
  Hence, the software metrics are required to capture various software attributes at differ-
ent phases of the software development. Object-oriented (OO) concepts such as coupling,
cohesion, inheritance, and polymorphism can be measured using software metrics. In this
chapter, we describe the measurement basics, software quality metrics, OO metrics, and
dynamic metrics. We also provide practical applications of metrics so that good-quality
systems can be developed.
3.1 Introduction
Software metrics can be used to adequately measure various elements of the software
development life cycle. The metrics can be used to provide feedback on a process or tech-
nique so that better or improved strategies can be developed for future projects. The qual-
ity of the software can be improved using the measurements collected by analyzing and
assessing the processes and techniques being used.
  The metrics can be used to answer the following questions during software development:
                                                                                          65
66                                                 Empirical Research in Software Engineering
The above questions can be addressed by gathering information using metrics. The infor-
mation will allow software developer, project manager, or management to assess, improve,
and control software processes and products during the software development life cycle.
The above definition provides all the relevant details. Software metrics should be collected
from the initial phases of software development to measure the cost, size, and effort of the
project. Software metrics can be used to ascertain and monitor the progress of the soft-
ware throughout the software development life cycle.
making effective decisions. The effective application of metrics can improve the quality
of the software and produce software within the budget and on time. The contributions of
software metrics in building good-quality system are provided in Section 3.9.1.
FIGURE 3.1
Steps in software measurement.
68                                                 Empirical Research in Software Engineering
determines the numerical relations corresponding to the empirical relations. In the next
step, real-world entities are mapped to numeric numbers, and in the last step, we deter-
mine whether the numeric relations preserve the empirical relation.
     1. Process: The process is defined as the way in which the product is developed.
     2. Product: The final outcome of following a given process or a set of processes is
        known as a product. The product includes documents, source codes, or artifacts
        that are produced during the software development life cycle.
The process uses the product produced by an activity, and a process produces products that
can be used by another activity. For example, the software design document is an artifact
produced from the design phase, and it serves as an input to the implementation phase. The
effectiveness of the processes followed during software development is measured using the
process metrics. The metrics related to products are known as product metrics. The effi-
ciency of the products is measured using the product metrics.
  The process metrics can be used to
For example, the effectiveness of the inspection activity can be measured by computing
costs and resources spent on it and the number of defects detected during the inspection
activity. By assessing whether the number of faults found outweighs the costs incurred
during the inspection activity or not, the project managers can decide about the effective-
ness of the inspection activity.
  The product metrics are used to measure the effectiveness of deliverables produced dur-
ing the software development life cycle. For example, size, cost, and effort of the deliver-
ables can be measured. Similarly, documents produced during the software development
(SRS, test plans, user guides) can be assessed for readability, usability, understandability,
and maintainability.
  The process and product metrics can further be classified as internal or external attributes.
The internal attribute concerns with the internal structure of the process or product. The com-
mon internal attributes are size, coupling, and complexity. The external attributes concern
with the behavior aspects of the process or product. The external attributes such as testability,
understandability, maintainability, and reliability can be measured using the process or prod-
uct metrics.
  The difference between attributes and metrics is that metrics are used to measure a
given attribute. For example, size is an attribute that can be measured through lines of
source code (LOC) metric.
  The internal attributes of a process or product can be measured without executing the
source code. For instance, the examples of internal attributes are number of paths, number
of branches, coupling, and cohesion. External attributes include quality attributes of the
system. They can be measured by executing the source code such as the number of failures,
Software Metrics                                                                                                               69
Software metrics
Process Product
                                                                                                                  Reliability,
               Failure rate found in             Effectiveness of a             Size, inheritance,
                                                                                                                 maintainability,
               reviews, no. of issues                 method                        coupling
                                                                                                                    usability
FIGURE 3.2
Categories of software metrics.
response time, and navigation easiness of an item. Figure 3.2 presents the categories of
software metrics with examples at the lowest level in the hierarchy.
                                   TABLE 3.1
                                   Example of Metrics Having Continuous Scale
                                   Class#              LOC Added               LOC Deleted
                                   A                       34                            5
                                   B                       42                           10
                                   C                       17                            9
70                                                      Empirical Research in Software Engineering
heavier than B with weight 100 pounds. Simple counts are represented by absolute scale.
The example of simple counts is number of faults, LOC, and number of methods. In abso-
lute type of scale, the descriptive statistics such as mean, median, and standard deviation
can be applied to summarize data.
  Nonmetric type of data can be measured on nominal or ordinal scales. Nominal scale
divides metric into classes, categories, or levels without considering any order or rank
between these classes. For example, Change is either present or not present in a class.
Another example of nominal scale is programming languages that are used as labels for dif-
ferent categories. In ordinal scale, one category can be compared with the other category in
terms of “higher than,” “greater than,” or “lower than” relationship. For example, the overall
navigational capability of a web page can be ranked into various categories as shown below:
                                                                       1, excellent
                                                                       
                                                                       2, good
                                                                       
           What is the overall navigational capability of a webpage? = 3, medium
                                                                       
                                                                       4, bad
                                                                       5, worst
	                                                                      
TABLE 3.2
Summary of Measurement Scales
Measurement
Scale           Characteristics     Statistics     Operations       Transformation        Examples
     Example 3.1
     Consider the count of number of faults detected during inspection activity:
       1. What is the measurement scale for this definition?
       2. What is the measurement scale if number of faults is classified between 1 and
          5, where 1 means very high, 2 means high, 3 means medium, 4 means low, and
          5 means very low?
     Solution:
        1. The measurement scale of the number of faults is absolute as it is a simple count
           of values.
        2. Now, the measurement scale is ordinal since the variable has been converted
           to be categorical (consists of classes), involving ranking or ordering among
           categories.
     A line of code is any line of program text that is not a comment or blank line, regardless
     of the number of statements or fragments of statements on the line. This specifically
     includes all lines containing program headers, declarations, and executable and non-
     executable statements.
In OO software development, the size of software can be calculated in terms of classes and
the attributes and functions included in the classes. The details of OO size metrics can be
found in Section 3.5.6.
72                                                            Empirical Research in Software Engineering
FIGURE 3.3
Operation to find greatest among three numbers.
                                                        Number of defects
                                     Defect density =
	                                                           KLOC
Software Metrics                                                                           73
The number of defects measure counts the defects detected during testing or by using any
verification technique.
   Defect rate can be measured as the defects encountered over a period of time, for instance
per month. The defect rate may be useful in predicting the cost and resources that will be
utilized in the maintenance phase of software development. Defect density during testing
is another effective metric that can be used during formal testing. It measures the defect
density during the formal testing after completion of the source code and addition of the
source code to the software library. If the value of defect density metric during testing is
high, then the tester should ask the following questions:
If the reason for high number of defects is the first one then the software should be thor-
oughly tested to detect the high number of defects. However, if the reason for high number
of defects is the second one, it implies that the quality of the system is good because of the
presence of fewer defects.
                                                 DB
                                       DRE =
	                                              D B +D A
where:
 DB depicts the defects encountered before software delivery
 DA depicts the defects encountered after software delivery
74                                                   Empirical Research in Software Engineering
Quantity and quality measures are expressed in percentages. For example, consider a
problem of proofreading an eight-page document. Quantity is defined as the percentage of
proofread words, and quality is defined as the percentage of the correctly proofread docu-
ment. Suppose quantity is 90% and quality is 70%, then task effectiveness is 63%.
  The other measures of usability defined in MUSIC project are (Bevan 1995):
                                          Effectiveness
                  Temporal efficiency =
                                           Task time
                                         Task time − unproductive time
                  Productive peroiid =                                 × 100
                                                   Task time
                                                User efficiency
                  Relative user efficiency =                     × 100
                                               Expert efficiency
	
There are various measures that can be used to measure the usability aspect of the system
and are defined below:
     • How the user is able to easily learn the interface paths in a webpage?
     • Are the interface titles understandable?
     • Whether the topics can be found in the ‘help’ easily or not?
The charts, such as bar charts, pie charts, scatter plots, and line charts, can be used to
depict and assess the satisfaction level of the customer. The satisfaction level of the cus-
tomer must be continuously monitored over time.
Software Metrics                                                                              75
NASA developed a test focus (TF) metric defined as the ratio of the amount of effort spent
in finding and removing “real” faults in the software to the total number of faults reported
in the software. The TF metric is given as (Stark et al. 1992):
On the basis of above direct measures, the following additional testing-related metrics
can be computed to derive more useful information from the basic metrics as given below.
3.5 OO Metrics
Because of growing size and complexity of software systems in the market, OO analysis
and design principles are being used by organizations to produce better designed, high–
quality, and maintainable software. As the systems are being developed using OO soft-
ware engineering principles, the need for measuring various OO constructs is increasing.
   Features of OO paradigm (programming languages, tools, methods, and processes) pro-
vide support for many quality attributes. The key concepts of OO paradigm are: classes,
objects, attributes, methods, modularity, encapsulation, inheritance, and polymorphism
(Malhotra 2009). An object is made up of three basic components: an identity, a state, and a
behavior (Booch 1994). The identity distinguishes two objects with same state and behav-
ior. The state of the object represents the different possible internal conditions that the
object may experience during its lifetime. The behavior of the object is the way the object
will respond to a set of received messages.
   A class is a template consisting of a number of attributes and methods. Every object
is the instance of a class. The attributes in a class define the possible states in which an
instance of that class may be. The behavior of an object depends on the class methods and
the state of the object as methods may respond differently to input messages depending on
the current state. Attributes and methods are said to be encapsulated into a single entity.
Encapsulation and data hiding are key features of OO languages.
   The main advantage of encapsulation is that the values of attributes remain private,
unless the methods are written to pass that information outside of the object. The internal
working of each object is decoupled from the other parts of the software thus achieving
modularity. Once a class has been written and tested, it can be distributed to other pro-
grammers for reuse in their own software. This is known as reusability. The objects can
be maintained separately leading to easier location and fixation of errors. This process is
called maintainability.
   The most powerful technique associated to OO methods is the inheritance relationship.
If a class B is derived from class A. Class A is said to be a base (or super) class and class B is
said to be a derived (or sub) class. A derived class inherits all the behavior of its base class
and is allowed to add its own behavior.
   Polymorphism (another useful OO concept) describes multiple possible states for a
single property. Polymorphism allows programs to be written based only on the abstract
interfaces of the objects, which will be manipulated. This means that future extension
in the form of new types of objects is easy, if the new objects conform to the original
interface.
Software Metrics                                                                                      77
TABLE 3.3
Chidamber and Kemerer Metric Suites
Metric                                 Definition                                Construct Being Measured
CBO         It counts the number of other classes to which a class is linked.           Coupling
WMC         It counts the number of methods weighted by complexity in a class.          Size
RFC         It counts the number of external and internal methods in a class.           Coupling
LCOM        Lack of cohesion in methods                                                 Cohesion
NOC         It counts the number of immediate subclasses of a given class.              Inheritance
DIT         It counts the number of steps from the leaf to the root node.               Inheritance
TABLE 3.4
Li and Henry Metric Suites
Metric                                 Definition                                Construct Being Measured
                  TABLE 3.5
                  Lorenz and Kidd Metric Suites for measuring Inheritance
                  Metric                                     Definition
                  NOP          It counts the number of immediate parents of a given class.
                  NOD          It counts the number of indirect and direct subclasses of a given class.
                  NMO          It counts the number of methods overridden in a class.
                  NMI          It counts the number of methods inherited in a class.
                  NMA          It counts the number of new methods added in a class.
                  SIX          Specialization index
TABLE 3.6
Briand et al. Metric Suites
IFCAIC              These coupling metrics count the number of interactions between classes.
ACAIC               These metrics distinguish the relationship between the classes (friendship, inheritance,
OCAIC                none), different types of interactions, and the locus of impact of the interaction.
FCAEC               The acronyms for the metrics indicates what interactions are counted:
DCAEC
                        • The first or first two characters indicate the type of coupling relationship between
OCAEC
                          classes:
IFCMIC                       A: ancestor, D: descendents, F: friend classes, IF: inverse friends (classes that declare
ACMIC                        a given class A as their friend), O: others, implies none of the other relationships.
DCMIC                   • The next two characters indicate the type of interaction:
FCMEC                        CA: There is a class–attribute interaction if class x has an attribute of type class y.
DCMEC                        CM: There is a class–method interaction if class x consist of a method that has
OCMEC                        parameter of type class y.
IFMMIC                       MM: There is a method–method interaction if class x calls method of another class y,
AMMIC                        or class x has a method of class y as a parameter.
OMMIC                   • The last two characters indicate the locus of impact:
FMMEC                        IC: Import coupling, counts the number of other classes called by class x.
DMMEC                        EC: Export coupling, count number of other classes using class y.
OMMEC
         TABLE 3.7
         Lee et al. Metric Suites
         Metric                               Definition                           Construct Being Measured
  Yap and Henderson-Sellers (1993) have proposed a suite of metrics to measure cohesion
and reuse in OO systems. Aggarwal et al. (2005) defined two reusability metrics namely
function template factor (FTF) and class template factor (CTF) that are used to mea-
sure reuse in OO systems. The relevant metrics summarized in tables are explained in
subsequent sections.
Software Metrics                                                                              79
                  TABLE 3.8
                  Benlarbi and Melo Polymorphism Metrics
                  Metric                               Definition
                  SPA          It measures static polymorphism in ancestors.
                  DPA          It measures dynamic polymorphism in ancestors.
                  SP           It is the sum of SPA and SPD metrics.
                  DP           It is the sum of DPA and DPD metrics.
                  NIP          It measures polymorphism in noninheritance relations.
                  OVO          It measures overloading in stand-alone classes.
                  SPD          It measures static polymorphism in descendants.
                  DPD          It measures dynamic polymorphism in descendants.
Figure 3.4 depicts the values for fan-in and fan-out metrics for classes A, B, C, D, E, and F of
an example system. The values of fan-out should be as low as possible because of the fact
that it increases complexity and maintainability of the software.
                                                  Class A
                                                Fan-out = 4
                                    Class F
                                   Fan-in = 2
                                  Fan-out = 0
FIGURE 3.4
Fan-in and fan-out metrics
80                                                      Empirical Research in Software Engineering
This definition also includes coupling based on inheritance. Chidamber and Kemerer
(1994) defined coupling between objects (CBO) as “the count of number of other classes
to which a class is coupled.” The CBO definition given in 1994 includes inheritance-based
coupling. For example, consider Figure 3.5, three variables of other classes (class B, class C,
and class D) are used in class A, hence, the value of CBO for class A is 3. Similarly, classes
D, F, G, and H have the value of CBO metric as zero.
   Li and Henry (1993) used data abstraction technique for defining coupling. Data abstrac-
tion provides the ability to create user-defined data types called abstract data types (ADTs).
Li and Henry defined data abstraction coupling (DAC) as:
                            DAC = number of ADTs defined in a class
In Figure 3.5, class A has three ADTs (i.e., three nonsimple attributes). Li and Henry defined
another coupling metric known as message passing coupling (MPC) as “number of unique
send statements in a class.” Hence, if three different methods in class B access the same
method in class A, then MPC is 3 for class B, as shown in Figure 3.6.
  Chidamber and Kemerer (1994) defined response for a class (RFC) metric as a set of
methods defined in a class and called by a class. It is given by RFC = |RS|, where RS, the
response set of the class, is given by:
                                        RS = I i ∪ all j {Eij }
	
where:
  Ii = set of all methods in a class (total i)
  Ri = {Rij} = set of methods called by Mi
                                                   A
                                               Fan-out = 3
                                                CBO = 3
                               B                   C
                           Fan-out = 2         Fan-out = 1           D
                            CBO = 2             CBO = 1
F H
FIGURE 3.5
Values of CBO metric for a small program.
Software Metrics                                                                            81
                                                         Class B
                              Class A
FIGURE 3.6
Example of MPC metric.
MethodA1() MethodB1()
MethodA2() MethodB2()
MethodA3()
Class C
MethodC1()
MethodC2()
FIGURE 3.7
Example of RFC metric.
For example, in Figure 3.7, RFC value for class A is 6, as class A has three methods of its
own and calls 2 other methods of class B and one of class C.
  A number of coupling metrics with respect to OO software have been proposed by
Briand et al. (1997). These metrics take into account the different OO design mechanisms
provided by the C++ language: friendship, classes, specialization, and aggregation. These
metrics may be used to guide software developers about which type of coupling affects
the maintenance cost and reduces reusability. Briand et al. (1997) observed that the cou-
pling between classes could be divided into different facets:
The metrics for CM interaction type are IFCMIC, ACMIC, OCMIC, FCMEC, DCMEC, and
OCMEC. In these metrics, the first one/two letters denote the type of relationship (IF denotes
inverse friendship, A denotes ancestors, D denotes descendant, F denotes friendship, and O
denotes others). The next two letters denote the type of interaction (CA, CM, MM) between
classes. Finally, the last two letters denote the type of coupling (IC or EC).
  Lee et al. (1995) acknowledged the need to differentiate between inheritance-based and
noninheritance-based coupling by proposing the corresponding measures: noninheritance
information flow-based coupling (NIH-ICP) and information flow-based inheritance coupling
(IH-ICP). Information flow-based coupling (ICP) metric is defined as the sum of NIH-ICP and
IH-ICP metrics and is based on method invocations, taking polymorphism into account.
 class A
 {
 B B1; // Nonsimple attributes
 C C1;
 public:
 void M1(B B1)
 {
 }
 };
 class B
 {
 public:
 void M2()
 {
 A A1;
 A1.M1();// Method of class A called
 }
 };
 class C
 {
 void M3(B::B1) //Method of class B passed as parameter
 {
 }
 };
FIGURE 3.8
Example for computing type of interaction.
Software Metrics                                                                              83
                                              (1 N ) ∑ i=1 µ ( Di ) − m
                                                         n
	                                   LCOM1 =
                                                       1− m
FIGURE 3.9
Example of LCOM metric.
84                                                             Empirical Research in Software Engineering
                                 Top : Integer
                                                                       Attributes
                                 a : Integer
                                 Push(a, n)
                                 Pop()                                 Methods
                                 Getsize()
                                 Empty ()
                                 Display ()
FIGURE 3.10
Stack class.
The approach by Bieman and Kang (1995) to measure cohesion was based on that of
Chidamber and Kemerer (1994). They proposed two cohesion measures—TCC and LCC.
TCC metric is defined as the percentage of pairs of directly connected public methods
of the class with common attribute usage. LCC is the same as TCC, except that it also
considers indirectly connected methods. A method M1  is indirectly connected with
method  M3, if  method M1  is connected to method M2  and method M2  is connected
to method M3. Hence, transitive closure of directly connected methods is represented by
indirectly connected methods. Consider the class stack shown in Figure 3.10.
  Figure 3.11 shows the attribute usage of methods. The pair of public functions with com-
mon attribute usage is given below:
    {(empty, push), (empty, pop), (empty, display), (getsize, push), (getsize, pop), (push, pop),
    (push, display), (pop, display)}
                                                           8
                                        TCC ( Stack ) =      × 100 = 80%
	                                                         10
The methods “empty” and “getsize” are indirectly connected, since “empty” is connected
to “push” and “getsize” is also connected to “push.” Thus, by transitivity, “empty” is con-
nected to “getsize.” Similarly “getsize” is indirectly connected to “display.”
LCC for stack class is as given below:
                                                          10
                                     LCC ( Stack ) =         × 100 = 100%
	                                                         10
FIGURE 3.11
Attribute usage of methods of class stack.
Software Metrics                                                                                  85
Lee et al. (1995) proposed information flow-based cohesion (ICH) metric. ICH for a class is
defined as the weighted sum of the number of invocations of other methods of the same
class, weighted by the number of parameters of the invoked method.
                                 AID =
                                         ∑ depth of each class
	                                        Total number of classes
A E
B C G
D F
FIGURE 3.12
Inheritance hierarchy.
86                                                            Empirical Research in Software Engineering
parent class. NMA counts the number of new methods (neither overridden nor inherited)
added in a class. NMI counts number of methods inherited by a class from its parent class.
Finally, Lorenz and Kidd (1994) defined specialization index (SIX) using DIT, NMO, NMA,
and NMI metrics as given below:
                                                 NMO × DIT
                                       SIX =
	                                              NMO + NMA + NMI
Consider the class diagram given in Figure 3.13. The class employee inherits class person.
The class employee overrides two functions, addDetails() and display(). Thus, the value of
NMO metric for class student is 2. Two new methods is added in this class (getSalary() and
compSalary()). Hence, the value of NMA metric is 2.
  Thus, for class Employee, the value of NMO is 2, NMA is 2, and NMI is 1 (getEmail()).
For the class Employee, the value of SIX is:
                                                  2×1   2
                                         SIX =         = = 0.4
	                                                2+ 2+1 5
The maximum number of levels in the inheritance hierarchy that are below the class are
measured through class to leaf depth (CLD). The value of CLD for class Person is 1.
Person
                                                 name: char
                                                 phone: integer
                                                 addr: integer
                                                 email: char
                                                 addDetails()
                                                 display()
                                                 getEmail()
Employee
                                                 Emp_id: char
                                                 basic: integer
                                                 da: real
                                                 hra: real
                                                 addDetails()
                                                 display()
                                                 getSalary()
                                                 compSalary()
FIGURE 3.13
Example of inheritance relationship.
Software Metrics                                                                            87
requirements. Yap and Henderson-Sellers (1993) discuss two measures designed to evaluate
the level of reuse possible within classes. The reuse ratio (U) is defined as:
                                             Number of superclasses
                                     U=
                                             Total number of classes
	
Consider Figure 3.13, the value of U is 1 2. Another metric is specialization ratio (S), and is
given as:
                                              Number of subclasses
                                      S=
                                             Number of superclasses
	
In Figure 3.13, Employee is the subclass and Person is the parent class. Thus, S = 1.
  Aggarwal et  al. (2005) proposed another set of metrics for measuring reuse by using
generic programming in the form of templates. The metric FTF is defined as ratio of num-
ber of functions using function templates to total number of functions as shown below:
                                              ∑
                                                    n
                                                           uses _ FT ( Fi )
                                        FTF =       i −1
                                                           ∑
                                                               n
                                                                      F
	                                                              i −1
where:
    void method1(){
    .........}
    template<class U>
    void method2(U &a, U &b){
    .........}
    void method3(){
    ........}
FIGURE 3.14
Source code for calculation of FTF metric.
88                                                                 Empirical Research in Software Engineering
    class X{
    .....};
    template<class U, int size>
    class Y{
    U ar1[size];
    ....};
FIGURE 3.15
Source code for calculating metric CTF.
                                                   ∑
                                                         n
                                                                uses _ CT ( Ci )
                                             CTF =       i −1
                                                             ∑
                                                                    n
                                                                           Ci
	                                                                   i −1
where:
                                         1
In Figure 3.15, the value of metric CTF = .
                                         2
i =1
where:
 M1,…Mn are methods defined in class K1 
 C1,…Cn are the complexities of the methods
Lorenz and Kidd defined number of attributes (NOA) metric given as the sum of number
of instance variables and number of class variables. Li and Henry (1993) defined number
of methods (NOM) as the number of local methods defined in a given class. They also
defined two other size metrics—namely, SIZE1  and SIZE2. These metrics are defined
below:
TABLE 3.9
Difference between static and dynamic metrics
S. No.                      Static Metrics                                  Dynamic Metrics
TABLE 3.10
Mitchell and Power Dynamic Coupling Metric Suite
Metric                                                                 Definition
Dynamic coupling between objects                 This metric is same as Chidamber and Kemerer’s
                                                   CBO metric, but defined at runtime.
Degree of dynamic coupling between two classes   It is the percentage of ratio of number of times a class A
 at runtime                                        accesses the methods or instance variables of another
                                                   class B to the total no of accesses of class A.
Degree of dynamic coupling within a given set    The metric extends the concept given by the above
 of classes                                        metric to indicate the level of dynamic coupling within
                                                   a given set of classes.
Runtime import coupling between objects          Number of classes assessed by a given class at runtime.
Runtime export coupling between objects          Number of classes that access a given class at runtime.
Runtime import degree of coupling                Ratio of number of classes assessed by a given class at
                                                   runtime to the total number of accesses made.
Runtime export degree of coupling                Ratio of number of classes that access a given class at
                                                   runtime to the total number of accesses made.
  • LOC added: Sum total of all the lines of code added to a file for all of its revisions
    in the repository
  • Max LOC added: Maximum number of lines of code added to a file for all of its
    revisions in the repository
  • Average LOC added: Average number of lines of code added to a file for all of its
    revisions in the repository
  • LOC deleted: Sum total of all the lines of code deleted from a file for all of its revi-
    sions in the repository
  • Max LOC deleted: Maximum number of lines of code deleted from a file for all of
    its revisions in the repository
  • Average LOC deleted: Average number of lines of code deleted from a file for all of
    its revisions in the repository
3.7.4 Miscellaneous
The other related evolution metrics are:
     • Max change set: Maximum number of files that are committed or checked in
       together in a repository
     • Average change set: Average number of files that are committed or checked in
       together in a repository
     • Age: Age of repository file, measured in weeks by counting backward from a given
       release of a software system
     • Weighted Age: Weighted Age of a repository file is given as:
                                 ∑
                                     N
                                            Age ( i ) × LOC added ( i )
                                     i =1
	                                        ∑ LOC added ( i )
       where:
         i is a revision of a repository file and N is the total number of revisions for that
              file
	
                                      ( ∃P ) , ( ∃Q ) u ( p ) ≠ u (Q )
    It ensures that no measure rates all program/class to be of same metric value.
    Property 2: Let c be a nonnegative number. Then, there are finite numbers of pro-
        gram/class with metric c. This property ensures that there is sufficient resolution
        in the measurement scale to be useful.
    Property 3: There are distinct program/class P and Q such that u ( p ) = u ( Q ) ⋅
    Property 4: For OO system, two programs/classes having the same functionality
        could have different values.
	
                                 ( ∃P ) ( ∃Q )  P ≡ Q and u ( P ) ≠ (Q )
    Property 5: When two programs/classes are concatenated, their metric should be
      greater than the metrics of each of the parts.
	
                      ( ∀P ) ( ∀Q ) u ( P ) ≤ u ( P + Q ) and u (Q ) ≤ u ( P + Q )
    Property 6: This property suggests nonequivalence of interaction. If there are two
      program/class bodies of equal metric value which, when separately concatenated
      to a same third program/class, yield program/class of different metric value.
    For program/class P, Q, R
	
                    ( ∃P ) ( ∃Q ) ( ∃R ) u ( P ) = u (Q ) and u ( P + R ) ≠ u (Q + R )
    Property 7: This property is not applicable for OO metrics (Chidamber and Kemerer 1994).
    Property 8: It specifies that “if P is a renaming of Q, then u ( P ) = u ( Q ).”
    Property 9: This property is not applicable for OO metrics (Chidamber and Kemerer
      1994).
and control quality, it is very important to understand how the quality can be measured.
Software metrics are widely used for measuring, monitoring, and evaluating the quality
of a project. Various software metrics have been proposed in the literature to assess the
software quality attributes such as change proneness, fault proneness, maintainability of
a class or module, and so on. A large portion of empirical research has been involved with
the development and evaluation of the quality models for procedural and OO software.
  Software metrics have found a wide range of applications in various fields of software engi-
neering. As discussed, some of the familiar and common uses of software metrics are sched-
uling the time required by a project, estimating the budget or cost of a project, estimating the
size of the project, and so on. These parameters can be estimated at the early phases of soft-
ware development life cycle, and thus help software managers to make judicious allocation
of resources. For example, once the schedule and budget has been decided upon, managers
can plan in advance the amount of person-hours (effort) required. Besides this, the design of
software can be assessed in the industry by identifying the out of range values of the software
metrics. One way to improve the quality of the system is to relate structural attribute mea-
sures intended to quantify important concepts of a given software, such as the following:
     •   Encapsulation
     •   Coupling
     •   Cohesion
     •   Inheritance
     •   Polymorphism
     •   Fault proneness
     •   Maintainability
     •   Testing effort
     •   Rework effort
     •   Reusability
     •   Development effort
The ability to assess quality of software in the early phases of the software life cycle is the
main aim of researchers so that structural attribute measures can be used for predicting exter-
nal attribute measures. This would greatly facilitate technology assessment and comparisons.
  Researchers are working hard to investigate the properties of software measures to
understand the effectiveness and applicability of the underlying measures. Hence, we need
to understand what these measures are really capturing, whether they are really differ-
ent, and whether they are useful indicators of quality attributes of interest? This will build
a body of evidence, and present commonalities and differences across various studies.
Finally, these empirical studies will contribute largely in building good quality systems.
the popular and widely used software metrics suite available to measure the constructs
is identified from the literature. Finally, a decision on the selection must be made on soft-
ware metrics. The criterion that can be used to select software metrics is that the selected
software metrics must capture all the constructs, be widely used in the literature, easily
understood, fast to compute, and computationally less expensive. The choice of metric
suite heavily depends on the goals of the research. For instance, in quality model pre-
diction, OO metrics proposed by Chidamber and Kemerer (1994) are widely used in the
empirical studies.
   In cases where multiple software metrics are used, the attribute reduction techniques
given in Section 6.2 must be applied to reduce them, if model prediction is being conducted.
to identify the threshold values of various OO metrics. Besides this, Shatnawi et al. (2010)
also investigated the use of receiver operating characteristics (ROCs) method to identify
threshold values. The detailed explanation of the above two methodologies is provided
in the below sub sections (Shatnawi 2006). Malhotra and Bansal (2014a), evaluated the
threshold approach proposed by Bender (1999) for fault prediction.
                                            1   Po          
                                   VARL =       ln        − α
                                            β   1 − Po     
	
where:
 α is a constant
 β is the estimated coefficient
 Po is the acceptable risk level
In this formula, α and β are obtained using the standard logistic regression formula
(refer Section 7.2.1). This formula will be used for each metric individually to find its
threshold value.
  For example, consider the following data set (Table A.8  in Appendix I) consisting of
the metrics (independent variables): LOC, DIT, NOC, CBO, LCOM, WMC, and RFC. The
dependent variable is fault proneness. We calculate the threshold values of all the metrics
using the following steps:
                                                 e ( )
                                                   g x
                                           P=
                                                1+ e ( )
                                                    g x
	
       where:
                                          g ( x ) = α + βx
	
       where:
         x is the independent variable, that is, an OO metric
         α is the Y-intercept or constant
         β is the slope or estimated coefficient
     Table 3.11 shows the statistical significance (sig.) for each metric. The “sig.” parame-
       ter provides the association between each metric and fault proneness. If the “sig.”
Software Metrics                                                                              97
                              TABLE 3.11
                              Statistical Significance of Metrics
                              Metric                 Significance
                              WMC                        0.013
                              CBO                        0.01
                              RFC                        0.003
                              LOC                        0.001
                              DIT                        0.296
                              NOC                        0.779
                              LCOM                       0.026
     value is below or at the significance threshold of 0.05, then the metric is said to
     be significant in predicting fault proneness (shown in bold). Only for significant
     metrics, we calculate the threshold values. It can be observed from Table 3.11 that
     DIT and NOC metrics are insignificant, and thus are not considered for further
     analysis.
  Step 2: Calculate the values of constant and coefficient for significant metrics.
  For significant metrics, the values of constant (α) and coefficient (β) using univariate
     logistic regression are calculated. These values of constant and coefficient will be
     used in the computation of threshold values. The coefficient shows the impact of
     the independent variable, and its sign shows whether the impact is positive or
     negative. Table 3.12 shows the values of constant (α) and coefficient (β) of all the
     significant metrics.
  Step 3: Computation of threshold values.
  We have calculated the threshold values (VARL) for the metrics that are found to be
     significant using the formula given above. The VARL values are calculated for
     different values of Po, that is, at different levels of risks (between Po = 0.01 and
     Po = 0.1). The threshold values at different values of Po (0.01, 0.05, 0.08, and 0.1)
     for all the significant metrics are shown in Table 3.13. It can be observed that the
     threshold values of all the metrics change significantly as Po changes. This shows
     that Po plays a significant role in calculating threshold values. Table 3.13 shows
     that at risk level 0.01 and 0.05, VARL values are out of range (i.e., negative values)
     for all of the metrics. At Po = 0.1, the threshold values are within the observation
     range of all the metrics. Hence, in this example, we say that Po = 0.1 is the appro-
     priate risk level and the threshold values (at Po = 0.1) of WMC, CBO, RFC, LOC,
     and LOCM are 17.99, 14.46, 52.37, 423.44, and 176.94, respectively.
                    TABLE 3.12
                    Constant (α) and Coefficient (β) of Significant Metrics
                    Metric             Coefficient (β)              Constant (α)
             TABLE 3.13
             Threshold Values on the basis of Logistic Regression Method
             Metrics    VARL at 0.01    VARL at 0.05     VARL at 0.08   VARL at 0.1
             WMC           −42.69           −15.17            −6.81         17.99
             CBO           −17.48            −2.99             1.41         14.46
             RFC           −61.41            −9.83             5.86         52.37
             LOC          −486.78           −74.11            51.41        423.44
             LCOM         −733.28          −320.61          −195.09        176.94
                            TABLE 3.14
                            Threshold Values or the basis of
                            ROC Curve Method
                            Metric            Threshold Value
                            WMC                       7.5
                            DIT                       1.5
                            NOC                       0.5
                            CBO                       8.5
                            RFC                        43
                            LCOM                     20.5
                            LOC                     304.5
  1. Using software metrics, the researcher can identify change/fault-prone classes that
     a. Enables software developers to take focused preventive actions that can reduce
         maintenance costs and improve quality.
     b. Helps software managers to allocate resources more effectively. For example, if
         we have 26% testing resources, then we can use these resources in testing top
         26% of classes predicted to be faulty/change prone.
  2. Among a large set of software metrics (independent variables), we can find a suit-
     able subset of metrics using various techniques such as correlation-based feature
     selection, univariate analysis, and so on. These techniques help in reducing the
     number of independent variables (termed as “data dimensionality reduction”).
     Only the metrics that are significant in predicting the dependent variable are con-
     sidered. Once the metrics found to be significant in detecting faulty/change-prone
     classes are identified, software developers can use them in the early phases of
     software development to measure the quality of the system.
  3. Another important application is that once one knows the metrics being captured
     by the models, and then such metrics can be used as quality benchmarks to assess
     and compare products.
  4. Metrics also provide an insight into the software, as well as the processes used to
     develop and maintain it.
  5. There are various metrics that calculate the complexity of the program. For exam-
     ple, McCabe metric helps in assessing the code complexity, Halstead metrics helps
     in calculating different measurable properties of software (programming effort,
     program vocabulary, program length, etc.), Fan-in and Fan-out metrics estimate
     maintenance complexity, and so on. Once the complexity is known, more complex
     programs can be given focused attention.
  6. As explained in Section 3.9.3, we can calculate the threshold values of different
     software metrics. By using threshold values of the metrics, we can identify and
     focus on the classes that fall outside the acceptable risk level. Hence, during the
100                                                Empirical Research in Software Engineering
      project development and progress, we can scrutinize the classes and prepare
      alternative design structures wherever necessary.
   7. Evolutionary algorithms such as genetic algorithms help in solving the optimiza-
      tion problems and require the fitness function to be defined. Software metrics help
      in defining the fitness function (Harman and Clark 2004) in these algorithms.
   8. Last, but not the least, some new software metrics that help to improve the quality
      of the system in some way can be defined in addition to the metrics proposed in
      the literature.
Exercises
  3.1 What are software metrics? Discuss the various applications of metrics.
  3.2 Discuss categories of software metrics with the help of examples of each category.
  3.3 What are categorical metric scales? Differentiate between nominal scale and ordinal
     scale in the measurements and also discuss both the concepts with examples.
  3.4 What is the role and significance of Weyuker’s properties in software metrics.
  3.5 Define the role of fan-in and fan-out in information flow metrics.
  3.6 What are various software quality metrics? Discuss them with examples.
  3.7 Define usability. What are the various usability metrics? What is the role of cus-
     tomer satisfaction?
  3.8 Define the following metrics:
     a. Statement coverage metric
Software Metrics                                                                          101
     b. Defect density
     c. FCMs
  3.9 Define coupling. Explain Chidamber and Kemerer metrics with examples.
  3.10 Define cohesion. Explain some cohesion metrics with examples.
  3.11 How do we measure inheritance? Explain inheritance metrics with examples.
  3.12 Define the following metrics:
     a. CLD
     b. AID
     c. NOC
     d. DIT
     e. NOD
      f. NOA
     g. NOP
     h. SIX
  3.13 What is the purpose and significance of computing the threshold of software
     metrics?
  3.14 How can metrics be used to improve software quality?
  3.15 Consider that the threshold value of CBO metric is 8 and WMC metric is 15. What
     does these values signify? What are the possible corrective actions according to you
     that a developer must take if the values of CBO and WMC exceed these values?
  3.16 What are the practical applications of software metrics? How can the metrics be
     helpful to software organizations?
  3.17 What are the five measurement scales? Explain their properties with the help of
     examples.
  3.18 How are the external and internal attributes related to process and product metrics?
  3.19 What is the difference between process and product metrics?
  3.20 What is the relevance of software metrics in research?
Further Readings
An in-depth study of eighteen  different categories of software complexity metrics was
provided by Zuse, where he tried to give basic definition for metrics in each category:
Fenton’s book on software metrics is a classic and useful reference as it provides in-depth
discussions on measurement and key concepts related to metrics:
  N. Fenton, and S. Pfleeger, Software Metrics: A Rigorous & Practical Approach, PWS
    Publishing Company, Boston, MA, 1997.
102                                            Empirical Research in Software Engineering
The traditional Software Science metrics proposed by Halstead are listed in:
Chidamber and Kemerer (1991)  proposed the first significant OO design metrics. Then,
another paper by Chidamber and Kemerer defined and validated the OO metrics suite
in 1994. This metrics suite is widely used and has obtained widest attention in empirical
studies:
The following paper explains various OO metric suites with real-life examples:
www.acis.pamplin.vt.edu/faculty/tegarden/wrk-pap/ooMETBIB.PDF
After the problem is defined, the experimental design process begins. The study must be
carefully planned and designed to draw useful conclusions from it. The formation of a
research question (RQ), selection of variables, hypothesis formation, data collection, and
selection of data analysis techniques are important steps that must be carefully carried
out to produce meaningful and generalized conclusions. This would also facilitate the
opportunities for repeated and replicated studies.
  The empirical study involves creation of a hypothesis that is tested using statistical
techniques based on the data collected. The model may be developed using multivariate
statistical techniques or machine learning techniques. The steps involved in the experi-
mental design are presented to ensure that proper steps are followed for conducting an
empirical study. In the absence of a planned analysis, a researcher may not be able to draw
well-formed and valid conclusions. All the activities involved in empirical design are
explained in detail in this chapter.
Identify goals
                                                            Hypothesis
                                                           formulation
                                                                                   Creating solution
                                                        Variable selection          to the problem
                                                          Empirical data
                                                           collection
FIGURE 4.1
Steps in experimental design.
of fault and has been published in Singh et al. (2010). Hereafter, the study will be referred
to as fault prediction system (FPS). The objective, motivation, and context of the study are
described below.
4.2.2 Motivation
The study predicts an important quality attribute, fault proneness during the early phases
of software development. Software metrics are used for predicting fault proneness. The
important contribution of this study is taking into account of the severity of faults dur-
ing fault prediction. The value of severity quantifies the impact of the fault on the soft-
ware operation. The IEEE standard (1044–1993, IEEE 1994) states, “Identifying the severity
of an anomaly is a mandatory category as is identifying the project schedule, and project
Experimental Design                                                                            105
cost impacts of any possible solution for the anomaly.” All the failures are not of the same
type; they may vary in the impact that they may cause. For example, a failure caused by a
fault may lead to a whole system crash or an inability to open a file (El Emam et al. 1999;
Aggarwal et al. 2009). In this example, it can be seen that the former failure is more severe
than the latter. Lack of determination of severity of faults is one of the main criticisms of
the approaches to fault prediction in the study by Fenton and Neil (1999). Therefore, there
is a need to develop prediction models that can be used to identify classes that are prone to
have serious faults. The software practitioners can use the model predicted with respect to
high severity of faults to focus the testing on those parts of the system that are likely to cause
serious failures. In this study, the faults are categorized with respect to all the severity levels
given in the NASA data set to improve the effectiveness of the categorization and provide
meaningful, correct, and detailed analysis of fault data. Categorizing the faults according to
different severity levels helps prioritize the fixing of faults (Afzal 2007). Thus, the software
practitioners can deal with the faults that are at higher priority first, before dealing with the
faults that are comparatively of lower priority. This would allow the resources to be judi-
ciously allocated based on the different severity levels of faults. In this work, the faults are
categorized into three levels: high severity, medium severity, and low severity.
   Several regression (such as linear and logistic regression [LR]) and machine learning
techniques (such as decision tree [DT] and artificial neural network [ANN]) have been pro-
posed in the literature. There are few studies that are using machine learning techniques
for fault prediction using OO metrics. Most of the prediction models in the literature are
built using statistical techniques. There are many machine learning techniques, and there
is a need to compare the results of various machine learning techniques as they give dif-
ferent results. ANN and DT methods have seen an explosion of interest over the years and
are being successfully applied across a wide range of problem domains such as finance,
medicine, engineering, geology, and physics. Indeed, these methods are being introduced
to solve the problems of prediction, classification, or control (Porter 1990; Eftekhar 2005;
Duman 2006; Marini 2008). It is natural for software practitioners and potential users to
wonder, “Which classification technique is best?,” or more realistically, “What methods
tend to work well for a given type of data set?” More data-based empirical studies, which
are capable of being verified by observation, or experiments are needed. Today, the evi-
dence gathered through these empirical studies is considered to be the most powerful
support possible for testing a given hypothesis (Aggarwal et  al. 2009). Hence, conduct-
ing empirical studies to compare regression and machine learning techniques is necessary
to build an adequate body of knowledge to draw strong conclusions leading to widely
accepted and well-formed theories.
4.2.4 Results
The results show that the area under the curve (measured from the ROC analysis) of mod-
els predicted using high-severity faults is low compared with the area under the curve of
the model predicted with respect to medium- and low-severity faults.
106                                               Empirical Research in Software Engineering
Hence, the RQs must fill the gap between existing literature and current work and must
give some new perspective to the problem. Figure 4.2 depicts the context of the RQs. The
RQ may be formed according to the research types given below:
FIGURE 4.2
Context of research questions.
Experimental Design                                                                      107
4.3.2 Characteristics of an RQ
The following are the characteristics of a good RQ:
  1. Clear: The reader who may not be an expert in the given topic should understand
     the RQs. The questions should be clearly defined.
  2. Unambiguous: The use of vague statements that can be interpreted in multiple ways
     should be avoided while framing RQs. For example, consider the following RQ:
     Are OO metrics significant in predicting various quality attributes?
     The above statement is very vague and can lead to multiple interpretations. This is
     because a number of quality attributes are present in the literature. It is not clear
     which quality attribute one wants to consider. Thus, the above vague statement
     can be redefined in the following way. In addition, the OO metrics can also be
     specified.
     Are OO metrics significant in predicting fault proneness?
  3. Empirical focus: This property requires generating data to answer the RQs.
  4. Important: This characteristic requires that answering an RQ adds significant
     contribution to the research and that there will be beneficiaries.
108                                               Empirical Research in Software Engineering
Finally, the research problem must be stated in either a declarative or interrogative form.
The examples of both the forms are given below:
  Declarative form: The present study focuses on predicting change-prone parts of the
     software at the early stages of software development life cycle. Early prediction of
     change-prone classes will lead to saving lots of resources in terms of money, man-
     power, and time. For this, consider the famous Chidamber and Kemerer metrics
     suite and determine the relationship between metrics and change proneness.
  Interrogative form: What are the consequences of predicting the change-prone parts
     of the software at the early stages of software development life cycle? What is the
     relationship between Chidamber and Kemerer metrics and change proneness?
  • RQ1: Which OO metrics are related to fault proneness of classes with regard to
    high-severity faults?
  • RQ2: Which OO metrics are related to fault proneness of classes with regard to
    medium-severity faults?
  • RQ3: Which OO metrics are related to fault proneness of classes with regard to
    low-severity faults?
  • RQ4: Is the performance of machine learning techniques better than the LR method?
Experimental Design                                                                            109
The main aim of the research is to contribute toward a better understanding of the con-
cerned field. A literature review analyzes a body of literature related to a research topic
to have a clear understanding of the topic, what has already been done on the topic, and
what are the key issues that need to be addressed. It provides a complete overview of the
existing work in the field. Figure 4.3 depicts various questions that can be answered while
conducting a literature review.
  The literature review involves collection of research publications (articles, conference
paper, technical reports, book chapters, journal papers) on a particular topic. The aim
is to gather ideas, views, information, and evidence on the topic under investigation.
             What are the key theories,                        What are the key areas where
              concepts, and ideas?                               knowledge gaps exist?
FIGURE 4.3
Key questions while conducting a review.
110                                                Empirical Research in Software Engineering
   1. Increase in familiarity with the previous relevant research and prevention from
      duplication of the work that has already been done.
   2. Critical evaluation of the work.
   3. Facilitation of development of new ideas and thoughts.
   4. Highlighting key findings, proposed methodologies, and research techniques.
   5. Identification of inconsistencies, gaps, and contradictions in the literature.
   6. Extraction of areas where attention is required.
     c. ScienceDirect/Elsevier
     d. Wiley
     e. ACM
      f. Google Scholar
     Before searching in digital portals, the researchers need to identify the most
     credible research journals in the related areas. For example, in the area of soft-
     ware engineering, some of the important journals in which search can be done
     are: Software: Practice and Experience, Software Quality Journal, IEEE Transactions on
     Software Engineering, Information and Software Technology, Journal of Computer Science
     and Technology, ACM Transactions on Software Engineering Methodology, Empirical
     Software Engineering, IEEE Software Maintenance, Journal of Systems and Software, and
     Software Maintenance and Evolution.
     Besides searching the journals and portals, various educational books, scientific
     monograms, government documents and publications, dissertations, gray litera-
     ture, and so on that are relevant to the concerned topic or area of research should
     be explored. Most importantly, the bibliographies and reference lists of the materi-
     als that are read need to be searched. These will give the pointers to more articles
     and can also be a good estimate about how much have been read on the selected
     topic of research.
     After the digital portals and Internet resources have been identified, the next step
     is to form the search string. The search string is formed by using the key terms
     from the selected topic in the research. The search string is used to search the
     literature from the digital portal.
  2. Conduct the search: This step involves searching the identified sources by using
     the formed search string. The abstracts and/or full texts of the research papers
     should be obtained for reading and analysis.
  3. Analyze the literature: Once the research papers relevant to the research topic
     have been obtained, the abstract should be read, followed by the introduction
     and conclusion sections. The relevant sections can be identified and read by the
     section headings. In case of books, the index must be scanned to obtain an idea
     about the relevant topics. The materials that are highly relevant in terms of mak-
     ing the greatest contribution in the related research or the material that seems the
     most convincing can be separated. Finally, a decision about reading the necessary
     content must be made.
     The strengths, drawbacks, and omissions in the literature review must be iden-
     tified on the basis of the evidence present in the papers. After thoroughly and
     critically analyzing the literature, the differences of the proposed work from the
     literature must be highlighted.
  4. Use the results: The results obtained from the literature review must then be
     summarized for later comparison with the results obtained from the current
     work.
of the subject under concern. It discusses the kind of work that is done on the concerned
topic of research, along with any controversies that may have been encountered by
different authors. The “body” contains and focuses on the main idea behind each paper in
the review. The relevance of the papers cited should be clearly stated in this section of the
review. It is not important to simply restate what the other authors have said, but instead
our main aim should be to critically evaluate each paper. Then, the conclusion should be
provided that summarizes what the literature says. The conclusion summarizes all the
evidence presented and shows its significance. If the review is an introduction to our own
research, it indicates how the previous research has lead to our own research focusing and
highlighting on the gaps in the previous research (Bell 2005). The following points must be
covered while writing a literature review:
  • Identify the topics that are similar in multiple papers to compare and contrast
    different authors’ view.
  • Group authors who draw similar conclusions.
  • Group authors who are in disagreement with each other on certain topics.
  • Compare and contrast the methodologies proposed by different authors.
  • Show how the study is related to the previous studies in terms of the similarities
    and the differences.
  • Highlight exemplary studies and gaps in the research.
The above-mentioned points will help to carry out effective and meaningful literature
review.
Basili et al.     C++           University environment,   C&K metrics, 3 code metrics     LR             LR                      Contingency table,
 (1996)                          180 classes                                                                                      correctness,
                                                                                                                                  completeness
Abreu and Melo    C++           University environment,   MOOD metrics                    Pearson        Linear least square     R2
 (1996)                          UMD: 8 systems                                            correlation
Binkley and       C++, Java     4 case studies, CCS:      CDM, DIT, NOC, NOD, NCIM,       Spearman       –                       –
 Schach (1998)                   113 classes, 82K SLOC,    NSSR, CBO                       rank
                                 29 classes, 6K SLOC                                       correlation
Harrison et al.   C++           University environment,   DIT, NOC, NMI, NMO              Spearman       –                       –
 (1999)                          SEG1: 16 classes SEG2:                                    rho
                                 22 classes SEG3:
                                 27 classes
Benlarbi and      C++           LALO: 85 classes,         OVO, SPA, SPD, DPA, DPD,        LR             LR                      –
 Melt (1999)                     40K SLOC                  CHNL, C&K metrics, part of
                                                           coupling metrics
El Emam et al.    Java          V0.5: 69 classes, V0.6:   Coupling metrics, C&K metrics   LR             LR                      R2, leave one-out
 (2001a)                         42 classes                                                                                       cross-validation
El Emam et al.    C++           Telecommunication         Coupling metrics, DIT           LR             LR                      R2
 (2001b)                         framework: 174 classes
Tang et al.       C++           System A: 20 classes      C&K metrics (without LCOM)      LR             –                       –
 (1999)                         System B: 45 classes
                                System C: 27 classes
Briand et al.     C++           University environment,   Suite of coupling metrics,      LR             LR                      R2, 10 cross-validation,
 (2000)                          UMD: 180 classes          49 metrics                                                             correctness,
                                                                                                                                  completeness
                                                                                                                                               (Continued)
                                                                                                                                                             113
                                                                                                                                                                    114
Glasberg et al.    Java          145 classes                  NOC, DIT ACAIC, OCAIC,            LR             LR                      R2, leave one-out
 (2000)                                                        DCAEC, OCAEC                                                              cross-validation, ROC
                                                                                                                                         curve, cost-saving
                                                                                                                                         model
El Emam et al.     C++, Java     Telecommunication            C&K metrics, NOM, NOA             LR             –                       –
 (2000a)                          framework: 174 classes,
                                  83 classes, 69 classes of
                                  Java system
Briand et al.      C++           Commercial system,           Suite of coupling metrics, OVO,   LR             LR                      R2, 10 cross-validation,
 (2001)                           LALO: 90 classes,            SPA, SPD, DPA, DPD, NIP, SP,                                             correctness,
                                  40K SLOC                     DP, 49 metrics                                                           completeness
Cartwright and     C++           32 classes, 133K SLOC        ATTRIB, STATES, EVENT,            Linear         Linear regression       –
 Shepperd (2000)                                               READS, WRITES, DIT, NOC           regression
Briand and         Java          Commercial system,           Polymorphism metrics, C&K         LR             LR, Mars                10 cross-validation,
 Wüst (2002)                      XPOSE & JWRITER:                                                                                       correctness,
                                  144 classes                                                                                            completeness
Yu et al. (2002)   Java          123 classes, 34K SLOC        C&K metrics, Fan-in, WMC          OLS+LDA        –                       –
Subramanyam        C++, Java                                  C&K metrics                       OLS            OLS
 and Krishnan
 (2003)
Gyimothy et al.    C++           Mozilla v1.6:                C&K metrics, LCOMN, LOC           LR, linear     LR, linear              10 cross-validation,
 (2005)                           3,192 classes                                                  regression,    regression, NN,         correctness,
                                                                                                 NN, DT         DT                      completeness
Aggarwal et al.    Java          University environment,      Suite of coupling metrics         LR             LR                      10 cross-validation,
 (2006a, 2006b)                   136 classes                                                                                           sensitivity, specificity
Arisholm and       Java          XRadar and JHawk             C&K metrics                       LR             LR
                                                                                                                                                                    Empirical Research in Software Engineering
Yuming and        C++           NASA data set,               C&K metrics                            LR, ML         LR, ML                  Correctness,
                                                                                                                                                                        Experimental Design
                                                                                                                                                          (Continued)
                                                                                                                                                                         116
Di Martino et al.   Java          Versions 4.0, 4.2, and       C&K, NPM, LOC                        –                Combination of          Precision, accuracy,
 (2011)                            4.3 of the jEdit system                                                            GA+SVM, LR,             recall, F-measure
                                                                                                                      C4.5, NB, MLP,
                                                                                                                      KNN, and RF
Azar and            Java          8 open source software       22 metrics by Henderson-Sellers      –                ACO, C4.5, random       Accuracy
 Vybihad (2011)                    systems                      (2007), Barnes and Swim (1993),                       guessing
                                                                Coppick and Cheatham (1992),
                                                                C&K
Malhotra and        –             Open source data set         C&K and QMOOD metrics                LR               LR and ML (ANN,         Sensitivity, specificity,
 Singh (2011)                      Arc, 234 classes                                                                   RF, LB, AB, NB,         precision
                                                                                                                      KStar, Bagging)
Malhotra and        Java          Apache POI, 422 classes      MOOD, QMOOD, C&K                     LR               LR, MLP (RF,            Sensitivity, specificity,
 Jain (2012)                                                    (19 metrics)                                          Bagging, MLP,           precision
                                                                                                                      SVM, genetic
                                                                                                                      algorithm)
Source: Compiled from multiple sources.
–implies that feature not examined.
LR: logistic regression, LDA: linear discriminant analysis, ML: machine learning, OLS: ordinary least square linear regression, PC: principal component analysis, NN:
neural network, BPN: back propagation neural network, PPN: probabilistic neural network, DT: decision tree, MLP: multilayer perceptron, SVM: support vector
machine, RF: random forest, GA+SVM: combination of genetic algorithm and support vector machine, NB: naïve Bayes, KNN: k-nearest neighbor, C4.5: decision tree,
ACO: ant colony optimization, Adtree: alternating decision tree, AB: adaboost, LB: logitboost, CHNL: class hierarchy nesting level: NCIM: number of classes inheriting
a method, NSSR: number of subsystems-system relationship: NPM, number of public methods: LCOMN, lack of cohesion on methods allowing negative value.
Related to metrics: C&K: Chidamber and Kemerer, MOOD: metrics for OO design, QMOOD: quality metrics for OO design.
                                                                                                                                                                         Empirical Research in Software Engineering
Experimental Design                                                                                           117
                            Independent variable 1
                                                                     Dependent variable
                                                         Process
Independent variable N
FIGURE 4.4
Relationship between dependent and independent variables.
TABLE 4.2
Differences between Dependent and Independent Variables
Independent Variable                                                       Dependent Variable
Variable that is varied, changed, or manipulated.           It is not manipulated. The response or outcome that is
                                                             measured when the independent variable is varied.
It is the presumed cause.                                   It is the presumed effect.
Independent variable is the antecedent.                     Dependent variable is the consequent.
Independent variable refers to the status of the            Dependent variable refers to the status of the
  “cause,” which leads to the changes in the status of       “outcome” in which the researcher is interested.
  the dependent variable.
Also known as explanatory or predictor variable.            Also known as response or predictor or target
                                                             variable.
For example, various metrics that can be used to            For example, whether a module is faulty or not.
 measure various software constructs.
118                                                Empirical Research in Software Engineering
For example, if the researcher wants to find whether a UML tool is better than a traditional
tool and the effectiveness of the tool is measured in terms of productivity of the persons
using the tool, then hypothesis testing can be used directly using the data given in Table 4.3.
  Consider another instance where the researcher wants to compare two machine learn-
ing techniques to find the effect of software metrics on probability of occurrence of faults.
In this problem, first the model is predicted using two machine learning techniques. In the
next step, the model is validated and performance is measured in terms of performance
evaluation metrics (refer Chapter 7). Finally, hypothesis testing is applied on the results
obtained in the previous step for verifying whether the performance of one technique is
better than the other technique.
  Figure  4.5  shows that the term independent and dependent variables is used in both
experimental studies and multivariate analysis. In multivariate analysis, the independent
and dependent variables are used in model prediction. The independent variables are used
as predictor variables to predict the dependent variable. In experimental studies, factors
for a statistical test are also termed as independent variables that may have one or more
                                    TABLE 4.3
                                    Productivity for Tools
                                    UML Tool          Traditional Tool
                                    14                       52
                                    67                       61
                                    13                       14
                                                                      Independent
                                 Independent
                                                                   variables or factors:
                              variables: software
                                                                     techniques and
                            metrics such as fan-in,
                                                                    methods such as
                            cyclomatic complexity
                                                                    machine learning
                                                                       techniques
                                                                  Dependent variable:
                                                                      accuracy
FIGURE 4.5
Terminology used in experimental studies and multivariate analysis studies.
120                                               Empirical Research in Software Engineering
levels called treatments or samples as suitable for a specific statistical test. For example, a
researcher may wish to test whether the mean of two samples is equal or not such as in the
case when a researcher wants to explore different software attributes like coupling before
and after a specific treatment like refactoring. Another scenario could be when a researcher
wants to explore the performance of two or more learning algorithms or whether two treat-
ments give uniform results. Thus, the dependent variable in experimental study refers to
the behavior measures of a treatment. In software engineering research, in some cases,
these may be the performance measures. Similarly, one may refer to performances on dif-
ferent data sets as data instances or subjects, which are exposed to these treatments.
  In software engineering research, the performance measures on data instances are
termed as the outcome or the dependent variable in case of hypothesis testing in experi-
mental studies. For example, technique A when applied on a data set may give an accuracy
(performance measure, defined as percentage of correct predictions) value of 80%. Here,
technique A is the treatment and the accuracy value of 80% is the outcome or the dependent
variable. However, in multivariate analysis or model prediction, the independent variables
are software metrics and the dependent variable may be, for example, a quality attribute.
  To avoid confusion, in this book, we use terminology related to multivariate analysis
unless and until specifically mentioned.
  Case 1: One factor, one treatments—In this case, there is one technique under obser-
    vation. For example, if the distribution of the data needs to be checked for a given
    variable, then this design type can be used. Consider a scenario where 25 students
    had developed the same program. The cyclomatic complexity values of the pro-
    gram can be evaluated using chi-square test.
  Case 2: One factor, two treatments—This type of design may be purely randomized
    or paired design. For example, a researcher wants to compare the performance
    of two verification techniques such as walkthroughs and inspections. Another
    instance is when a researcher wants to compare the performance of two machine
    learning techniques, naïve Bayes and DT, on a given or over multiple data sets. In
    these two examples, factor is one (verification method or machine learning tech-
    nique) but treatments are two. Paired t-test or Wilcoxon test can be used in these
    cases. Chapter 6 provides examples for these tests.
Experimental Design                                                                             121
                            TABLE 4.4
                            Factors and Levels of Example
                            Factor                 Level 1     Level 2
  Case 3: One factor, more than two treatments—In this case, the technique that is to
    be analyzed contains multiple values. For example, a researcher wants to compare
    multiple search-based techniques such as genetic algorithm, particle swarm opti-
    mization, genetic programming, and so on. Friedman test can be used to solve this
    example. Section 6.4.13 provides solution for this example.
  Case 4: Multiple factors and multiple treatments—In this case, more than one factor
    is considered with multiple treatments. For instance, consider an example where
    a researcher wants to compare paradigm types such as structured paradigm with
    OO paradigm. In conjunction to the paradigm type, the researcher also wants to
    check the complexity of the software being difficult or simple. This example is
    shown in Table 4.4 along with the factors and levels. ANOVA test can be used to
    solve such examples.
The examples of the above experimental design types are given in Section 6.4. After deter-
mining the appropriate experiment design type, the hypothesis needs to be formed in an
empirical study.
  • It provides the researcher with a relational statement that can be directly tested in
    a research study.
122                                                    Empirical Research in Software Engineering
                                            Primary thought
                                           (not fully formed)
                                                                     Thought-through and
                                          Research questions
                                                                       well-formed idea
Research hypothesis
FIGURE 4.6
Generation of hypothesis in a research.
        TABLE 4.5
        Transition from RQ to Hypothesis
        RQ                                                    Corresponding Hypothesis
        Is X related to Y?                                If X, then Y.
        How are X and Y related to Z?                     If X and Y, then Z.
        How is X related to Y and Z?                      If X, then Y and Z.
        How is X related to Y under conditions Z and W?   If X, then Y under conditions Z and W.
  4. Write down the hypotheses in a format that is testable through scientific research:
     There are two types of hypothesis—null and alternative hypotheses. Correct for-
     mation of null and alternative hypotheses is the most important step in hypoth-
     esis testing. The null hypothesis is also known as hypothesis of no difference and
     denoted as H0. The null hypothesis is the proposition that implies that there is no
     statistically significant relationship within a given set of parameters. It denotes the
     reverse of what the researcher in his experiment would actually expect or predict.
     Alternative hypothesis is denoted as Ha. The alternative hypothesis reflects that a
     statistically significant relationship does exist within a given set of parameters. It
     is the opposite of null hypothesis and is only reached if H0 is rejected. The detailed
     explanation of null and alternative hypothesis is stated in the next Section 4.7.5.
     Table 4.5 presents corresponding hypothesis to given RQs.
Some of the examples to show the transition from an RQ to a hypothesis are stated below:
  RQ: What is the relation of coupling between classes and maintenance effort?
  Hypothesis: Coupling between classes and maintenance effort are positively related
    to each other.
     Example 4.1:
     There are various factors that may have an impact on the amount of effort required to
     maintain a software. The programming language in which the software is developed
     can be one of the factors affecting the maintenance effort. There are various program-
     ming languages available such as Java, C++, C#, C, Python, and so on. There is a
     need to identify whether these languages have a positive, negative, or neutral effect
     on the maintenance effort. It is believed that programming languages have a positive
     impact on the maintenance effort. However, this needs to be tested and confirmed
     scientifically.
     Solution:
     The problem and hypothesis derived from it is given below:
                      Define hypothesis
                      • Define null hypothesis
                      • Define alternate hypothesis
                                       Derive conclusions
                                       • Check statistical significance of results
FIGURE 4.7
Steps in hypothesis testing.
Experimental Design                                                                        125
  The null hypothesis can be written in mathematical form, depending on the particular
descriptive statistic using which the hypothesis is made. For example, if the descriptive
statistic is used as population mean, then the general form of null hypothesis is,
Ho : µ = X
where:
 µ is the mean
 X is the predefined value
In this example, whether the population mean equals X or not is being tested.
  There are two possible scenarios through which the value of X can be derived. This
depends on two different types of RQs. In other words, the population parameter (mean in
the above example) can be assigned a value in two different ways. First reason is that the
predetermined value is selected for practical or proved reasons. For example, a software
company decides that 7 is its predetermined quality parameter for mean coupling. Hence,
all the departments will be informed that the modules must have a value of <7 for coupling
to ensure less complexity and high maintainability. Similarly, the company may decide
that it will devote all the testing resources to those faults that have a mean rating above 3.
The testers will therefore want to test specifically all those faults that have mean rating >3.
  Another situation is where a population under investigation is compared with another
population whose parameter value is known. For example, from the past data it is known
that average productivity of employees is 30  for project A. We want to see whether the
average productivity of employees is 30 or not for project B? Thus, we want to make an
inference whether the unknown average productivity for project B is equal to the known
average productivity for project A.
  The general form of alternative hypothesis when the descriptive parameter is taken as
mean (µ) is,
                                          Ha : µ ≠ X
	
where:
 µ is the mean
 X is the predefined value
The above hypothesis represents a nondirectional hypothesis as it just denotes that there
will be a difference between the two groups, without discussing how the two groups differ.
The example is stated in terms of two popularly used methods to measure the size of soft-
ware, that is, (1) LOC and (2) function point analysis (FPA). The nondirectional hypothesis
can be stated as, “The size of software as measured by the two techniques is different.”
Whereas, when the hypothesis is used to show the relationship between the two groups
rather than simply comparing the groups, then the hypothesis is known as directional
hypothesis. The comparison terms such as “greater than,” “less than,” and so on is used in
the formulation of hypothesis. In other words, it specifies how the two groups differ. For
example, “The size of software as measured by FPA is more accurate than LOC.” Thus, the
direction of difference is mentioned. The same concept is represented by one-tailed and
two-tailed tests in statistical testing and is explained in Section 6.4.3.
  One important point to note is that the potential outcome that a researcher is expecting
from his/her experiment is denoted in terms of alternative hypothesis. What is believed
to be the theoretical expectation or concept is written in terms of alternative hypothesis.
126                                                 Empirical Research in Software Engineering
Thus, sometimes the alternative hypothesis is referred to as the research hypothesis. Now,
if the alternative hypothesis represents the theoretical expectation or concept, then what
is the reason for performing the hypothesis testing? This is done to check whether the
formed or assumed concepts are actually significant or true. Thus, the main aim is check
the validity of the alternative hypothesis. If null hypothesis is accepted, it signifies that the
idea or concept of research is false.
There are various tests available in research for verifying hypothesis and are given as
follows:
                                                                     Critical region or
                                                                    region of rejection
FIGURE 4.8
Critical region.
FIGURE 4.9
Significance levels.
128                                                         Empirical Research in Software Engineering
                            TABLE 4.6
                            A Sample Data Set
                                        CBO for Faulty         CBO for Nonfaulty
                            S. No.        Modules                  Modules
                            1                   45                        9
                            2                   56                        9
                            3                   34                        9
                            4                   71                        7
                            5                   23                       10
                            6                    9                       15
                            Mean               39.6                     9.83
              H a : µ ( CBO faulty ) > µ ( CBO nonfaulty ) or µ ( CBO faulty ) < µ ( CBO nonfaulty )
	
    Step 3: Determining the appropriate test to apply
    As the problem is of comparing means of two dependent samples (collected from
       same software), the paired t-test is used. In Chapter 6, the conditions for selecting
       appropriate tests are given.
    Step 4: Calculating the value of test statistic
    Table 4.7 shows the intermediary calculations of t-test.
    The t-statistics is given as:
                                                      µ1 − µ 2
                                                 t=
	                                                     σd n
    where:
      µ1 is the mean of first population
      µ2 is the mean of second population
                                               ∑ d − ( ∑ d )          
                                                                    2
                                                      2
                                                                        n
                                       σd =                              
	                                                         n−1
Experimental Design                                                                                129
                      TABLE 4.7
                      T-Test Calculations
                      CBO for Faulty        CBO for Nonfaulty    Difference
                      Modules                   Modules              (d)       D2
                      45                              9              36       1,296
                      56                              9              47       2,209
                      34                              9              25         625
                      71                              7              64       4,096
                      23                             10              13         169
                      9                              15              –6          36
      where:
          n represents number of pairs and not total number of samples
          d is the difference between values of two samples
    Substituting the values of mean, variance, and sample size in the above formula, the
      t-score is obtained as:
                           ∑ d − ( ∑ d )          
                                                2
                                 2
                                                    n   8431 − ( 179 ) 6 
                                                                        2
                    σd =                              =                   = 24.86
	                                     n−1                         5
                                          µ1 − µ 2 39.66 − 9.83
                                     t=           =             = 2.93
	                                         σd n      24.86 6
    As the alternative hypothesis is of the form, H1: µ > X or µ < X, the tail of sampling
       distribution is nondirectional. Let us take the level of significance (α) for one-
       tailed test as 0.05.
    Step 5: Determine the significance value
    The p-value at significance level of 0.05 (two-tailed test) is considered and df as 5.
       From the t-distribution table, it is observed that the p-value is 0.032 (refer to Section
       6.4.6 for computation of p-value).
    Step 6: Deriving conclusions
    Now, to decide whether to accept or reject the null hypothesis, this p-value is compared
       with the level of significance. As the p-value (0.032) is less than the level of significance
       (0.05), the H0 is rejected. In other words, the alternative hypothesis is accepted. Thus,
       it is concluded that there is statistical difference between the average of coupling
       metrics for faulty classes and the average of coupling metrics for nonfaulty classes.
are tested to compare the performance of regression and machine learning techniques at
different severity levels of faults:
  First degree: The researcher is in direct contact or involvement with the subjects
     under concern. The researcher or software engineer may collect data in real-time.
     For example, under this category, the various methods are brainstorming, inter-
     views, questionnaires, think-aloud protocols, and so on. There are various other
     methods as depicted in Figure 4.10.
  Second degree: There is no direct contact of the researcher with the subjects during
     data collection. The researcher collects the raw data without any interaction with
     the subjects. For example, observations through video recording and fly on the wall
     (participants taping their work) are the two methods that come under this category.
  Third degree: There is access only to the work artifacts. In this, already avail-
     able and compiled data is used. For example, analysis of various documents
     produced from an organization such as the requirement specifications, fail-
     ure reports, document change logs, and so on come under this category. There
     are various reports that can be generated using different repositories such
     as change report, defect report, effort data, and so on. All these reports play
     an important role while conducting a research. But the accessibility of these
     reports from the industry or any private organization is not an easy task. This
     is discussed in the next subsection, and the detailed collection methods are
     presented in Chapter 5.
The main advantage of the first and second degree methods is that the researcher has
control over the data to a large extent. Hence, the researcher needs to formulate and decide
132                                                              Empirical Research in Software Engineering
                                                                 • Inquisitive techniques
                                                                 Brainstorming and focus groups
                                                                 Interviews
                                                                 Questionnaires
                                            First degree         Conceptual modeling
                                       (direct involvement of • Observational techniques
                                        software engineers)      Work diaries
                                                                 Think-aloud protocols
                                                                 Shadowing and observation
                                                                 synchronized shadowing
                                                                 Participant observation (join the team)
                                           Second degree           • Instrumenting systems
                                      (indirect involvement of     • Fly on the wall (participants taping their
                                         software engineers)         work)
FIGURE 4.10
Various data-collection strategies.
on data-collection methods in the experimental design phase. The  methods under these
categories require effort from both the researcher and the subject. Because of this reason,
first degree methods are most expensive than the second or third degree methods. Third
degree methods are least expensive, but the control over data is minimum. This compro-
mises the quality of the data as the correctness of the data is not under the direct control
of the researcher.
   Under first degree category, the interviews and questionnaires are the most easy
and straightforward methods. In interview-based data collection, the researcher pre-
pares a list of questions about the areas of interest. Then, an interview session takes
place between the researcher and the subject(s), wherein the researcher can ask vari-
ous research-related questions. Questions can be either open, inviting multiple and
broad range of answers, or closed, offering a limited set of answers. The drawback of
collecting data from interviews and questionnaires is that they produce typically an
incomplete picture. For example, if one wants to know the number of LOC in a soft-
ware program. Conducting interviews and questionnaires will only provide us general
opinions and evidence, but the accurate information is not provided. Methods such as
think-aloud protocols and work diaries can be used for this strategy of data collection.
Second degree requires access to the environment in which participants or subject(s)
work, but without having direct contact with the participants. Finally, the third degree
requires access only to work artifacts, such as source code or bugs database or docu-
mentation (Wohlin 2012).
TABLE 4.8
Differences between the Types of Data Sets
S. No.             Academic                          Industrial                         Open Source
1        Obtained from the projects       Obtained from the projects           Obtained from the projects
          made by the students of          developed by experienced and          developed by experienced
          some university                  qualified programmers                 developers located at different
                                                                                 geographical locations
2        Easy to obtain                   Difficult to obtain                  Easy to obtain
3        Obtained from data set that is   Obtained from data set               Obtained from data set
           not necessarily maintained      maintained over a long period         maintained over a long period
           over a long period of time      of time                               of time
4        Results are not reliable and     Results are highly reliable and      Results may be reliable and
           acceptable                      acceptable                            acceptable
5        It is freely available           May or may not be freely             It is generally freely available
                                           available
6        Uses ad hoc approach to          Uses very well planned               Uses well planned and mature
          develop projects                 approach                             approach
7        Code may be available            Code is not available                Code is easily available
8        Example: Any software            Example: Performance Manage-         Example: Android, Apache
          developed in university such     ment traffic recording (Lindvall     Tomcat, Eclipse, Firefox, and
          as LALO (Briand et al. 2001),    1998), commercial OO system          so on
          UMD (Briand et al. 2000),        implemented in C++ (Bieman
          USIT (Aggarwal et al. 2009)      et al. 2003), UIMS (Li  and Henry
                                           1993), QUES (Li and Henry 1993)
university systems, industrial or commercial systems, and public or open source soft-
ware. The academic data is the data that is developed by the students of some univer-
sity. Industrial data is the proprietary data belonging to some private organization or a
company. Public data sets are available freely to everyone for use and does not require any
payment from the user. The differences between them are stated in Table 4.8.
   It is relatively easy to obtain the academic data as it is free from confidentiality concerns
and, hence, gaining access to such data is easier. However, the accuracy and reliability of
the academic data is questionable while conducting research. This is because the university
software is developed by inexperienced, small number of programmers and is typically
not applicable in real-life scenarios. Besides the university data sets, there is public or open
source software that is widely used for conducting empirical research in the area of soft-
ware engineering. The use of open source software allows the researchers to access vast
repositories of reasonable quality, large-sized software. The most important type of data is
the proprietary/industrial data that is usually owned by a corporation/organization and
is not publically available.
   The usage of open source software has been on the rise, with products such as Android
and Firefox becoming household names. However, majority of the software devel-
oped across the world, especially the high-quality software, still remains proprietary
software. This is because of the fact that given the voluntary nature of developers for
open source software, the attention of the developers might shift elsewhere leading to
lack of understanding and poor quality of the end product. For the same reason, there
are also challenges with timeliness of the product development, rigor in testing and
documentation, as well as characteristic lack of usage support and updates. As opposed
to this, the proprietary software is typically developed by an organization with clearly
134                                                Empirical Research in Software Engineering
demarcated manpower for design, development, and testing of the software. This allows
for committed, structured development of software for a well-defined end use, based on
robust requirement gathering. Therefore, it is imperative that the empirical studies in
software engineering be validated over data from proprietary systems, because the devel-
opers of such proprietary software would be the key users of the research. Additionally,
industrial data is better suited for empirical research because the development follows
a structured methodology, and each step in the development is monitored and docu-
mented along with its performance measurement. This leads to development of code that
follows rigorous standards and robustly captures the data sets required by the academia
for conducting their empirical research.
  At the same time, access to the proprietary software code is not easily obtained. For most
of the software development organizations, the software constitutes their key intellectual
asset and they undertake multiple steps to guard the privacy of the code. The world’s most
valuable products, such as Microsoft Windows and Google search, are built around their
closely held patented software to guard against competition and safeguard their products
developed with an investment of billions of dollars. Even if there is appreciation of the role
and need of the academia to access the software, the enterprises typically hesitate to share
the data sets, leading to roadblocks in the progress of empirical research.
  It is crucial for the industry to appreciate that the needs of the empirical research do not
impinge on their considerations of software security. The data sets required by the academia
are the metrics data or the data from the development/testing process, and does not com-
promise on security of the source code, which is the primary concern of the industry. For
example, assume an organization uses commercial code management system/test manage-
ment system such as HP Quality Center or HP Application Lifecycle Management. Behind
the scenes, a database would be used to store information about all modules, including all
the code and its versions, all development activity in full detail, and the test cases and their
results. In such a scenario, the researcher does not need access to the data/code stored in the
database, which the organization would certainly be unwilling to share, but rather specific
reports corresponding to the problem he wishes to address. As an illustration, for a defect
prediction study, only a list of classes with corresponding metrics and defect count would
be required, which would not compromise the interests of the organization. Therefore, with
mutual dialogue and understanding, appropriate data sets could be shared by the industry,
which would create a win-win situation and lead to betterment of the process. The key chal-
lenge, which needs to be overcome, is to address the fear of the enterprises regarding the
type of data sets required and the potential hazards. A constructive dialogue to identify the
right reports would go a long way towards enabling the partnership because access to the
wider database with source code would certainly be impossible.
  Once the agreement with the industry has been reached and the right data sets have been
received, the attention can be shifted to actual conducting of the empirical research with
the more appropriate industrial data sets. The benefits of using the industrial database
would be apparent in the thoroughness of the data sets available and the consistency of
the software system. This would lead to more accurate findings for the empirical research.
data, which is implemented in the C++ programming language. Fault data for KC1  is
collected since the beginning of the project (storage management system) but that data
can only be associated back to five years (MDP 2006). This system consists of 145 classes
that comprise 2,107 methods, with 40K LOC. KC1 provides both class-level and method-
level static metrics. At the method level, 21 software product metrics based on product’s
complexity, size, and vocabulary are given. At the class level, values of ten  metrics are
computed, including six metrics given by Chidamber and Kemerer (1994). The seven OO
metrics are taken in this study for analyses. In KC1, six files provide association between
class/method and metric/defect data. In particular, there are four files of interest, the first
representing the association between classes and methods, the second representing asso-
ciation between methods and defects, the third representing association between defects
and severity of faults, and the fourth representing association between defects and specific
reason for closure of the error report.
  First, defects are associated with each class according to their severities. The value of
severity quantifies the impact of the defect on the overall environment with 1 being most
severe to 5  being least severe as decided in data set KC1. The defect data from KC1  is
collected from information contained in error reports. An error either could be from the
source code, COTS/OS, design, or is actually not a fault. The defects produced from the
source code, COTS/OS, and design are taken into account. The data is further processed
by removing all the faults that had “not a fault” keyword used as the reason for closure of
error report. This reduced the number of faults from 669 to 642. Out of 145 classes, 59 were
faulty classes, that is, classes with at least one fault and the rest were nonfaulty.
  In this study, the faults are categorized as high, medium, or low severity. Faults with
severity rating 1 were classified as high-severity faults. Faults with severity rating 2 were
classified as medium-severity faults and faults with severity rating 3, 4, and 5 as low-sever-
ity faults, as at severity rating 4 no class is found to be faulty and at severity rating 5 only
one class is faulty. Faults at severity rating 1 require immediate correction for the system
to continue to operate properly (Zhou and Leung 2006).
  Table 4.9 summarizes the distribution of faults and faulty classes at high-, medium-, and
low-severity levels in the KC1 NASA data set after preprocessing of faults in the data set.
High-severity faults were distributed in 23 classes (15.56%). There were 48 high-severity
faults (7.47%), 449 medium-severity faults (69.93%), and 145 low-severity faults (22.59%). As
shown in Table 4.9, majority of the classes are faulty at severity rating medium (58 out of
59 faulty classes). Figure 4.11a–c shows the distribution of high-severity faults, medium-
severity faults, and low-severity faults. It can be seen from Figure  4.11a that 22.92% of
classes with high-severity faults contain one fault, 29.17% of classes contain two faults, and
so on. In addition, the maximum number of faults (449 out of 642) is covered at medium
severity (see Figure 4.11b).
          TABLE 4.9
          Distribution of Faults and Faulty Classes at High-, Medium-, and Low-Severity
          Levels
          Level of      Number of       % of Faulty     Number of      % of Distribution
          Severity    Faulty Classes     Classes         Faults            of Faults
                                                                                      1–2 Faults
                 8 Faults                                          29–77 Faults         7.75%
                 16.67%                    1 Faults                  18.64%                        3–4 Faults
                                           22.92%                                                    7.26%
                                                                                                        5–6 Faults
    5 Faults
                                                                                                         11.38%
    10.42%
                                                       15–28 Faults
                                                          15.5%
      4 Faults                                                                                         7–9 Faults
       8.33%                                                                                            13.32%
                                            2 Faults
                                            29.17%
                 3 Faults                                                   10–14 Faults
                  12.5%                                                       26.15%
   (a)                                                 (b)
                                                             1 Faults
                                            9–12 Faults       5.52%
                                              15.17%
                                                                              2 Faults
                                                                              16.55%
                                                               4–7 Faults
                                                                17.93%
                            (c)
FIGURE 4.11
Distribution of (a) high-, (b) medium-, and (c) low-severity faults.
   1. Diversity in data: The variables or attributes of the data set may belong to different
      categories such as discrete, continuous, discrete ordered, counts, and so on. If the
      attributes are of many different kinds, then some of the algorithms are preferable
      over others as they are easy to apply. For example, among machine learning tech-
      niques, support vector machine, neural networks, and nearest neighbor methods
      require that the input attributes are numerical and scaled to similar ranges (e.g., to
      the [–1,1] interval). Among statistical techniques, linear regression and LR require
                                                                                  Logistic
                                                                                 regression
                                                            Statistical
                                                                               Discriminant
                                                                                 analysis
                                        Binary
                                                                               Decision tree
                                                            Machine
                  Type of                                   learning          Support vector
                 dependent                                                       machine
                  variable                                  Machine
                                                            learning          Artificial neural
                                                                                  network
                                      Continuous
                                                                                   Linear
                                                            Statistical          regression
                                                                               Ordinary least
                                                                                  square
FIGURE 4.12
Selection of data analysis methods based on the type of dependent variable.
138                                               Empirical Research in Software Engineering
     the input attributes be numerical. The machine learning technique that can han-
     dle heterogeneous data is DT. Thus, if our data is heterogeneous, then one may
     apply DT instead of other machine learning techniques (such as support vector
     machine, neural networks, and nearest neighbor methods).
  2. Redundancy in the data: There may be some independent variables that are redun-
     dant, that is, they are highly correlated with other independent variables. It is advis-
     able to remove such variables to reduce the number of dimensions in the data set.
     But still, sometimes it is found that the data contains the redundant information. In
     this case, the researcher should make careful selection of the data analysis methods,
     as some of the methods will give poor performance than others. For example, linear
     regression, LR, and distance-based methods, will give poor performance because of
     numerical instabilities. Thus, these methods should be avoided.
  3. Type and existence of interactions among variables: If each attribute makes an
     independent impact or contribution to the output or dependent variable, then
     the techniques based on linear functions (e.g., linear regression, LR, support vec-
     tor machines,  naïve Bayes) and distance functions (e.g.,  nearest neighbor meth-
     ods,  support vector machines with Gaussian kernels) perform well. But, if the
     interactions among the attributes are complex and huge, then DT and neural net-
     work should be used as these techniques are particularly composed to deal with
     these interactions.
  4. Size of the training set: Selection of appropriate method is based on the tradeoff
     between bias/variance. The main idea is to simultaneously minimize bias and
     variance. Models with high bias will result in underfitting (do not learn relation-
     ship between the dependent and independent variables), whereas models with
     high variance will result in overfitting (noise in the data). Therefore, a good learn-
     ing technique automatically adjusts the bias/variance trade-off based on the size
     of training data set. If the training set is small, high bias/low variance classifiers
     should be used over low bias/high variance classifiers. For example, naïve Bayes
     has a high bias/low variance (naïve Bayes is simple and assumes independence of
     variables) and k-nearest neighbor has a low bias/high variance. But as the size of
     training set increases, low bias/high variance classifiers show good performance
     (they have lower asymptotic error) as compared with high bias/low variance clas-
     sifiers. High bias classifiers (linear) are not powerful enough to provide accurate
     models.
    TABLE 4.10
    Data Analysis Methods Corresponding to Machine Learning Tasks
    S. No.   Machine Learning Tasks                     Data Analysis Methods
Besides the four above-mentioned important aspects, there are some other considerations
that help in making a decision to select the appropriate method. These considerations are
sensitivity to outliers, ability to handle missing values, ability to handle nonvector data,
ability to handle class imbalance, efficacy in high dimensions, and accuracy of class prob-
ability estimates. They should also be taken into account while choosing the best data
analysis method. The procedure for selection of appropriate learning technique is further
described in Section 7.4.3.
   The methods are classified into two categories: parametric and nonparametric. This
classification is made on the basis of the population under study. Parametric methods
are those for which the population is approximately normal, or can be approximated to
normal using a normal distribution. Parametric methods are commonly used in statistics
to model and analyze ordinal or nominal data with small sample sizes. The methods
are generally more interpretable, faster but less accurate, and more complex. Some of
the parametric methods include LR, linear regression, support vector machine, principal
component analysis, k-means, and so on. Whereas, nonparametric methods are those for
which the data has an unknown distribution and is not normal. Nonparametric meth-
ods are commonly used in statistics to model and analyze ordinal or nominal data with
small sample sizes. The data cannot even be approximated to normal if the sample size
is so small that one cannot apply the central limit theorem. Nowadays, the usage of non-
parametric methods is increasing for a number of reasons. The main reason is that the
researcher is not forced to make any assumptions about the population under study as is
done with a parametric method. Thus, many of the nonparametric methods are easy to
use and understand. These methods are generally simpler, less interpretable, and slower
but more accurate. Some of the nonparametric methods are DT, nearest neighbor, neural
network, random forest, and so on.
Exercises
  4.1. What are the different steps that should be followed while conducting experi-
     mental design?
  4.2. What is the difference between null and alternative hypothesis? What is the
     importance of stating the null hypothesis?
140                                             Empirical Research in Software Engineering
  4.3. Consider the claim that the average number of LOC in a large-sized software is
     at most 1,000 SLOC. Identify the null hypothesis and the alternative hypothesis
     for this claim.
  4.4. Discuss various experiment design types with examples.
  4.5. What is the importance of conducting an extensive literature survey?
  4.6. How will you decide which studies to include in a literature survey?
  4.7. What is the difference between a systematic literature review, and a more general
     literature review?
  4.8. What is a research problem? What is the necessity of defining a research problem?
  4.9. What are independent and dependent variables? Is there any relationship
     between them?
  4.10. What are the different data-collection strategies? How do they differ from one
     another?
  4.11. What are the different types of data that can be collected for empirical research?
     Why the access to industrial data is difficult?
  4.12. Based on what criteria can the researcher select the appropriate data analysis
     method?
Further Readings
The book provides a thorough and comprehensive overview of the literature review
process:
  A. Fink, Conducting Research Literature Reviews:  From the Internet to Paper. 2nd edn.
    Sage Publications, London, 2005.
  E. L. Lehmann, and J.P. Romano, Testing Statistical Hypothesis, 3rd edn., Springer,
     Berlin, Germany, 2008.
A classic paper provides techniques for collecting valid data that can be used for gathering
more information on development process and assess software methodologies:
One of the problems faced by the software engineering community is scarcity of data for
conducting empirical studies. However, the software repositories can be mined to col-
lect and gather the data that can be used for providing empirical results by validating
various techniques or methods. The empirical evidence gathered through analyzing the
data collected from the software repositories is considered to be the most important sup-
port for software engineering community these days. These evidences can allow software
researchers to establish well-formed and generalized theories. The data obtained from
software repositories can be used to answer a number of questions. Is design A better than
design B? Is process/method A better than process/method B? What is the probability of
occurrence of a defect or change in a module? Is the effort estimation process accurate?
What is the time taken to correct a bug? Is testing technique A better than testing tech-
nique B? Hence, the field of extracting data from software repositories is gaining impor-
tance in organizations across the globe and has a central and essential role in aiding and
improving the software engineering research and development practice.
   As already mentioned in Chapter 1 and 4 the data can either be collected from propri-
etary software, open source software (OSS), or university software. However, obtaining
data from proprietary software is extremely difficult as the companies are not usually
willing to share the source code and information related to the evolution of the software.
Another source for collecting empirical data is academic software developed by universi-
ties. However, collecting data from software developed by student programmers is not
recommended, as the accuracy and applicability of this data cannot be determined. In
addition, the university software is developed by inexperienced, small number of pro-
grammers and thus does not have applicability in the real-life scenarios.
   The rise in the popularity of the use of OSS has made vast amount of data available for use
in empirical research in the area of software engineering. The information from open source
repositories can be easily extracted in a well-structured manner. Hence, now researchers have
access to vast repositories containing large-sized software maintained over a period of time.
   In this chapter, the basic techniques and procedures for extracting data from software
repositories is provided. A detailed discussion on how change logs and bug reports are
organized and structured is presented. An overview of existing software engineering
repositories is also given. In this chapter, we present defect collection and reporting sys-
tem that can be used for collecting changes and defects from maintenance phase.
                                                                                          143
144                                              Empirical Research in Software Engineering
development life cycle. The artifacts (also known as deliverables) produced during the
software development life cycle include software requirement specification, software
design document, source code listings, user manuals, and so on (Bersoff et  al. 1980;
Babich 1986).
  A configuration management system also controls any changes incurred in these arti-
facts. Typically, configuration management consists of three activities: configuration
identification, configuration control, and configuration accounting (IEEE/ANSI Std.
1042–1987, IEEE 1987).
  • Release: The first issue of a software artifact is called a release. This usually
    provides most of the functionalities of a product, but may contain a large number
    of bugs and thus is prone to issue fixing and enhancements.
  • Versions: Significant changes incurred in the software project’s artifacts are called
    versions. Each version tends to enhance the functionalities of a product, or fix
    some critical bugs reported in the previous version. New functionalities may or
    may not be added.
  • Editions: Minor changes or revisions incurred in the software artifacts are termed
    as editions. As opposed to a version, an edition may not introduce significant
    enhancements or fix some critical issues reported in the previous version. Rather,
    small fixes and patches are introduced.
Change request
      Determine technical
                                               Analyze the impact and
      feasibility, costs, and
                                                plan for the change
             benefits
FIGURE 5.1
Change cycle.
 Change Request ID
 Type of Change Request               □ Enhancement                □ Defect Fixing                 □ Other (Specify)
 Project
 Requested By                         Project team member name
 Brief Description of the Change
                                      Description of the change being requested
  Request
 Date Submitted
 Date Required
 Priority                             □ Low                 □ Medium                 □ High              □ Mandatory
 Severity                             □ Trivial             □ Moderate               □ Serious           □ Critical
 Reason for Change                    Description of why the change is being requested
 Estimated Cost of Change             Estimates for the cost of incurring the change
 Other Artifacts Impacted             List other artifacts affected by this change
 Signature
FIGURE 5.2
Change request form.
146                                                       Empirical Research in Software Engineering
 Change Request ID
 Type of Change Request            □ Enhancement               □ Defect Fixing    □ Other (Specify)
 Project
 Module in which change is made
 Change Implemented by             Project team member name
 Date and time of change
  implementation
 Change Approved By                CCB member who approved the change
 Brief Description of the Change
                                   Description of the change incurred
  Request
 Decision                          □ Approved       □ Approved with Conditions   □ Rejected   □ Other
 Decision Date
 Conditions                        Conditions imposed by the CCB
 Approval Signature
FIGURE 5.3
Software change notice.
In the next section, we present the importance of mining information from software
repositories, that is, information gathered from historical data such as defect and
change logs.
Mining Data from Software Repositories                                                  147
FIGURE 5.4
Data analysis procedure after mining software repositories.
                                                      Software
                                                    repositories
FIGURE 5.5
Commonly used software repositories.
  For example, consider an application that consists of a module (say module 1) that takes
in some input and writes it to a data store, and another module (module 2) that reads the
data from that data store. If there is a modification in the source code of the module that
saves data to the data store, we may be required to perform changes to module 2 that
retrieves data from that data store, although there are no traditional dependencies (such
as control flow dependency) between the two modules. Such dependencies can be deter-
mined if and only if we analyze the historical data available for the software project. For
this example, data extracted from historical repositories will reveal that the two modules,
for saving the data to the data store and reading the data from that data store, are co-
changing, that is, a change in module 1 has resulted in a change in module 2.
  Historical repositories include source control repositories, bug repositories, and archived
communications.
      previous version of that software system. Even for a given version, several editions
      may be released to incorporate some minor changes in the software system.
   7. Application/domain: A software system usually serves a fundamental purpose or
      application, along with some optional or secondary features. Open source systems
      typically belong to one of these domains: graphics/media/3D, IDE, SDK, database,
      diagram/visualization, games, middleware, parsers/generators, programming
      language, testing, and general purpose tools that combine multiple such domains.
5.5.1 Introduction
In this section we provide classification of VCS. Each and every change, no matter how big
or small, is recorded over time so that we may recall specific revisions or versions of the
system artifacts later.
152                                                  Empirical Research in Software Engineering
The following general terms are associated with a VCS (Ball et al. 1997):
Employing a VCS also means that if we accidentally modify, damage, or even lose some
project artifacts, we can generally recover them easily by simply cloning or downloading
those artifacts from the VCS. Generally, this can be achieved with insignificant costs and
overheads.
                                         Branch 1
                                     Branch 2
                                                Baseline (original line of development)
Branch 3
FIGURE 5.6
Trunk and branches.
Mining Data from Software Repositories                                                   153
Local system
Versioning database
Revision 1
                         Project artifacts
                                                                Revision 2
Revision 3
FIGURE 5.7
Local version control.
                                                       Versioning database
                          Project
                          artifacts
                                                           Revision 1
                          Project
                                                           Revision 3
                          artifacts
FIGURE 5.8
Centralized version control.
  However, if the central server fails or the data stored at central server is corrupted or lost,
there are no chances of recovery unless we maintain periodic backups. Figure 5.8 presents
the concept of a CVCS.
Server
Versioning database
Revision 1
Revision 2
Revision 3
Client Client
                         Project                                    Project
                         artifacts                                  artifacts
Revision 1 Revision 1
Revision 2 Revision 2
Revision 3 Revision 3
FIGURE 5.9
Distributed version control systems.
generally referred to as known bugs. The information about a bug typically includes the
following:
  •   The time when the bug was reported in the software system
  •   Severity of the reported bug
  •   Behavior of the source program/module in which the bug was encountered
  •   Details on how to reproduce that bug
  •   Information about the person who reported that bug
  •   Developers who are possibly working to fix that bug, or will be assigned the job
      to do so
Many bug tracking systems also support tracking through the status of a bug to deter-
mine what is known as the concept of bug life cycle. Ideally, the administrators of a bug
tracking system are allowed to manipulate the bug information, such as determining
the possible values of bug status, and hence the bug life cycle states, configuring the
permissions based on bug status, changing the status of a bug, or even remove the
bug information from the database. Many systems also update the administrators and
developers associated with a bug through emails or other means, whenever new infor-
mation is added in the database corresponding to the bug, or when the status of the bug
changes.
  The primary advantage of a bug tracking system is that it provides a clear, concise,
and centralized overview of the bugs reported in any phase of the software develop-
ment life cycle, and their state. The information provided is valuable for defining the
product road map and plan of action, or even planning the next release of a software
system (Spolsky 2000).
  Bugzilla is one of the most widely used bug tracking systems. Several open source proj-
ects, including Mozilla, employ the Bugzilla repository.
                                                                                Source control
                         Software                         Defect
                                                                                  repository
                       repositories                     repositories
FIGURE 5.10
The procedure for defect/change data collection.
5.7.1 CVS
CVS is a popular CVCS that hosts a large number of OSS systems (Cederqvist et al. 1992).
CVS has been developed with the primary goal to handle different revisions of various
software project artifacts by storing the changes between two subsequent revisions of
these artifacts in the repository. Thus, CVS predominantly stores the change logs rather
than the actual artifacts such as binary files. It does not imply that CVS cannot store binary
files. It can, but they are not handled efficiently.
   The features provided by CVS are discussed below (http://cvs.savannah.gnu.org):
   Revision numbers: Each new revision or version of a project artifact stored in the CVS
     repository is assigned a unique revision number by the VCS itself. For example,
     the first version of a checked in artifact is assigned the revision number 1.1. After
     the artifacts are modified (updated) and the changes are committed (permanently
     recorded) to the CVS repository, the revision number of each modified artifact
     is incremented by one. Since some artifacts may be more affected by updation
     or changes than the others, the revision numbers of the artifacts are not unique.
     Therefore, a release of the software project, which is basically a snapshot of the
     CVS repository, comprises of all the artifacts under version control where the arti-
     facts can have individual revision numbers.
   Branching and merging: CVS supports almost all of the functionalities pertaining to
     branches in a VCS. The user can create his/her own branch for development, and
     view, modify, or delete a branch created by the user as well as other users, provided
     the user is authorized to access those branches in the repository. To create a new
     branch, CVS chooses the first unused even integer, starting with 2, and appends
     it to the artifacts’ revision number from where the branch is forked off, that is,
     the user who has created that branch wishes to work on those particular artifacts
     only. For example, the first branch, which is created at the revision number 1.2 of
158                                              Empirical Research in Software Engineering
    an artifact, receives the branch number 1.2.2 but CVS internally stores it as 1.2.0.2.
    However, the main issue with branches is that the detection of branch merges is
    not supported by CVS. Consequently, CVS does not boast of enough mechanisms
    that support tracking of evolution of typically large-sized software systems as well
    as their particular products.
  Version control data: For each artifact, which is under the repository’s version con-
    trol, CVS generates detailed version control data and saves it in a change log or
    simply log files. The recorded log information can be easily retrieved by using
    the CVS log command. Moreover, we can specify some additional parameters so
    as to allow the retrieval of information regarding a particular artifact or even the
    complete project directory.
Figure 5.11 depicts a sample change log file stored by the CVS. It shows the versioning data
for the source file “nsCSSFrameConstructor.cpp,” which is taken from the Mozilla project.
The CVS change log file typically comprises of several sections and each section presents
the version history of an artifact (source file in the given example). Different sections are
always separated by a single line of “=” characters.
  However, a major shortcoming of CVS that haunts most of the developers is the lack of
functionality to provide appropriate mechanisms for linking detailed modification reports
and classifying changes (Gall et al. 2003).
  The following attributes are recorded in the above commit record:
  • RCS file: This field contains the path information to identify an artifact in the
    repository.
  • Locks and AccessList: These are file content access and security options set by
    the developer during the time of committing the file with the CVS. These may be
    used to prevent unauthorized modification of the file and allow the users to only
    download certain file, but does not allow them to commit protected or locked files
    with the CVS repository.
  • Symbolic names: This field contains the revision numbers assigned to tag names.
    The assignment of revision numbers to the tag names is carried out individually
    for each artifact because the revision numbers might be different.
  • Description: This field contains the modification reports that describe the change
    history of the artifact, beginning from the first commit until the current version.
    Apart from the changes incurred in the head or main trunk, changes in all the
    branches are also recorded there. The revisions are separated by a few number of
    “-” characters.
  • Revision number: This field is used to identify the revision of source code artifact
    (main trunk, branch) that has been subject to change(s).
  • Date: This field records the date and time of the check in.
  • Author: This field provides the information of the person who committed the
    change.
  • State: This field provides information about the state of the committed artifact and
    generally assumes one of these values: “Exp” (experimental) and “dead” (file has
    been removed).
Mining Data from Software Repositories                                                    159
 RCS file:
 /cvsroot/mozilla/layout/html/style/src/nsCSSFrameConstructor.cpp,v
 Working file: nsCSSFrameConstructor.cpp
 head: 1.804
 branch:
 locks: strict
 access list:
 symbolic names:
    MOZILLA_1_3a_RELEASE: 1.800
    NETSCAPE_7_01_RTM_RELEASE: 1.727.2.17
    PHOENIX_0_5_RELEASE: 1.800
    ...
    RDF_19990305_BASE: 1.46
    RDF_19990305_BRANCH: 1.46.0.2
 keyword substitution: kv
 total revisions: 976; selected revisions: 976
 description:
 ----------------------------
 revision 1.804
 date: 2002/12/13 20:13:16; author: doe@netscape.com; state: Exp; lines: +15 - 47
 ....
 ----------------------------
 ....
 =============================================================
 RCS file:
 /cvsroot/mozilla/layout/html/style/src/nsCSSFrameConstructor.h,v
FIGURE 5.11
Example log file from Mozilla project at CVS.
   • Lines: This field counts the lines added and/or deleted of the newly checked in
     revision compared with the previous version of a file. If the current revision is
     also a branch point, a list of branches derived from this revision is listed in the
     branches field. In the above example, the branches field is blank, indicating that the
     current revision is not a branch point.
   • Free Text: This field provides the comments entered by the author while commit-
     ting the artifact.
5.7.2 SVN
SVN is a commonly employed CVCS provided by the Apache organization that hosts a
large number of OSS systems, such as Tomcat and other Apache projects. It is also free and
open source VCS.
160                                                Empirical Research in Software Engineering
  Being a CVCS, SVN has the capability to operate across various networks, because of
which people working on different locations and devices can use SVN. Similar to other
VCS, SVN also conceptualizes and implements a version control database or repository in
the same manner. However, different from a working copy, a SVN repository can be con-
sidered as an abstract entity, which has the ability to be accessed and operated upon almost
exclusively by employing the tools and libraries, such as the Tortoise-SVN.
  The features provided by SVN are discussed below:
  Revision numbers: Each revision of a project artifact stored in the SVN repository is
    assigned a unique natural number, which is one more than the number assigned to
    the previous revision. This functionality is similar to that of CVS. The initial revi-
    sion of a newly created repository is typically assigned the number “0,” indicating
    that it consists of nothing other than an empty trunk or main directory. Unlike
    most of the VCS (including CVS), the revision numbers assigned by SVN apply
    to the entire repository tree of a project, not the individual project artifacts. Each
    revision number represents an entire tree, or a specific state of the repository after
    a change is committed. In other words, revision “i” means the state of the SVN
    repository after the “ith” commit. Since some artifacts may be more affected by
    updation or changes than the others, it implies that the two revisions of a single
    file may be the same, since even if one file is changed the revision number of each
    and every artifact is incremented by one. Therefore, every artifact has the same
    revision number for a given version of the entire project.
  Branching and merging: SVN fully provides the developers with various options to
    maintain parallel branches of their project artifacts and directories. It permits them
    to create branches by simply replicating or copying their data, and remembers
    that the copies which are created are related among themselves. It also supports
    the duplication of changes from a given branch to another. SVN’s repository is
    specially calibrated to support efficient branching. When we duplicate or copy
    any directory to create a branch, we need not worry that the entire SVN repository
    will grow in size. Instead, SVN does not copy any data in reality. It simply creates
    a new directory entry, pointing to an existing tree in the repository. Owing to this
    mechanism, branches in the SVN exist as normal directories. This is opposed to
    many of the other VCS, where branches are typically identified by some specific
    “labels” or identifiers to the concerned artifacts.
SVN also supports the merging of different branches. As an advantage over CVS, SVN
1.5 had incorporated the feature of merge tracking to SVN. In the absence of this feature,
a great deal of manual effort and the application of external tools were required to keep
track of merges.
   Version control data: This functionality is similar to CVS. For each artifact, which is under
version control in the repository, SVN also generates detailed version control data and stores
it to change log or simply log files. The recorded log information can be easily retrieved
by using a SVN client, such as Tortoise-SVN client, and also by the “svn log” command.
Moreover, we can also specify some additional parameters so as to allow the retrieval of
information regarding a particular artifact or even the complete project directory.
   Figure 5.12 depicts a sample change log file stored by the SVN repository. It presents the
versioning data for the source file “mbeans-descriptors.dtd” of the Apache’s Tomcat project.
   Although the SVN classifies changes to the files as modified, added, or deleted, there
are no other classification types for the incurred changes that are directly provided by it,
Mining Data from Software Repositories                                                    161
=============================================================
Revision: 1561635
Actions: Modified
Author: kkolinko
 Modified:
 /tomcat/trunk/java/org/apache/tomcat/util/modeler/mbeans-descriptors.dtd
Added: Nil
Deleted: Nil
Message:
 Followup to r1561083
 Remove svn:mime-type property from *.dtd files.
 The value was application/xml-dtd.
=============================================================
FIGURE 5.12
Example log file from Apache Tomcat project at SVN.
such as classifying changes for enhancement, bug-fixing, and so on. Even though we have
a “Bugzilla-ID” field, it is still optional and the developer committing the change is not
bound to specify it, even if he has fixed a bug already reported in the Bugzilla database.
  The following attributes are recorded in the above commit record:
   • Revision number: This field identifies the source code revision (main trunk,
     branch) that has been modified.
   • Actions: This field specifies the type of operation(s) performed with the file(s)
     being changed in the current commit. Possible values include “Modified” (if a
     file has been changed), “Deleted” (if a file has been deleted), “Added” (if a file has
     been added), and a combination of these values is also possible, in case there are
     multiple files affected in the current commit.
   • Author: This field identifies the person who did the check in.
   • Date: Date and time of the check in, that is, permanently recording changes with
     the SVN, are recorded in the date field.
   • Bugzilla ID (optional): This field contains the ID of a bug (if the current commit
     fixes a bug) that has also been reported in the Bugzilla database. If specified, then
     this field may be used to link the two repositories: SVN and Bugzilla, together.
162                                             Empirical Research in Software Engineering
    We may obtain change logs from the SVN (through version control data) and bug
    details from the Bugzilla.
  • Modified: This field lists the source code files that were modified in the current
    commit. In the above log file, the file “mbeans-descriptors.dtd” was modified.
  • Added: This field lists the source code files that were added to the project in the
    current commit. In the above log file, this field is not specified, indicating that no
    files have been added.
  • Message: The following message field contains informal data entered by the author
    during the check in process.
5.7.3 Git
Git is a popular DVCS, which is being increasingly employed by a large number of soft-
ware organizations and software repositories throughout the world. For instance, Google
hosts maintains the source control data for a large number of its software projects through
Git, including the Android operating system, Chrome browser, Chromium OS, and many
more (http://git-scm.com).
  Git stores and accesses the data as content addressable file systems. It implies that a
simple Hash-Map, or a key–value pair is the fundamental concept of Git’s data storage and
access mechanism. The following terms are specific to Git and are an integral part of Git’s
data storage mechanism:
                                               Git object
                                               SHA: String
                                               Type: String
                                              Size: Integer
                                              Content: Byte
                                 Tree                               Tag
                              Name: String                      Name: String
                   Subtrees   Type: “Tree”                      Type: “Tag”
                                  Blob
                                                                   Commit
                              Name: String
                                                               Message: String
                              Type: “Blob”
                                                               Type: “Commit”
                              Mode: Integer
FIGURE 5.13
Git storage structure.
   • Tag: It contains a reference to another Git object and may also hold some metadata
     of another object. The fields of a Tag object are as follows:
      • Name of the Tag object (string)
      • Type of the Tag object, having the fixed value as “Tag”
Figure 5.13 depicts the data structure or data model of Git VCS. Figure 5.14 presents an
example of how data is stored by Git.
  The tree has three edges, which correspond to different file directories. The first two
edges point to blob objects, which store the actual file content. The third edge points to
another tree or file directory, which stores the file “simplegit.rb” in the blob object.
  However, Git visualizes and stores the information much differently than the other VCS,
such as CVS, even though it provides a similar user interface. The important differences
between Git and CVS are highlighted below:
   Revision numbers: Similar to CVS, each new version of a file stored in the Git reposi-
     tory receives a unique revision number (e.g., 1.1 is assigned to the first version of
     a committed file) and after the commit operation, the revision number of each
     modified file is incremented by one. But in contrast to CVS, and many other VCS,
     that store the change-set (i.e., changes between subsequent versions), Git thinks of
     its data more like a set of snapshots of a mini file system. Every time a user per-
     forms a commit and saves the state of his/her project with Git, Git simply captures
     a snapshot of what all the files look like at that particular moment of committing,
     and then reference to that snapshot is stored. For efficiency, Git simply stores the
     link to the previous file, if the files in current and previous commit are identical.
   Local operations: In Git, most of the operations are done using files on client machine,
     that is, local disk. For example, if we want to know the changes between current
     version and version created few months back. Git does local calculation by looking
     up the differences between current version and previous version instead of getting
164                                                        Empirical Research in Software Engineering
Root directory
Tree 1
                                                                Su
                                                                  bd
                                     File 1   File 2                 ire
                                                                        cto
                                                                            ry
                                                                                 Tree 2
                            Blob 1               Blob 2
File 3
Blob 3
FIGURE 5.14
Example storage structure at Git.
     information from remote server or downloading previous version from the remote
     server. Thus, the user feels the increase in speed as the network latency overhead
     will be reduced. Further, lots of work can be done offline.
   Branching and merging: Git also provides its users to exploit its branching capabili-
     ties easily and efficiently. All the basic operations on a branch, such as creation,
     cloning, modification, and so on, are fully supported by Git. In CVS, the main
     issue with branches is that CVS does not support detection of branch merges.
     However, Git determines to use for its merge base, the best common ancestor. This
     is contrary to CVS, wherein the developer performing the merging has to figure
     out the best merge base himself. Thus, merging is much easier in Git.
   Version control data: Similar to CVS, for each working file, Git generates version con-
     trol data and saves it in log files. From here, the log file and its metadata can be
     retrieved by using the “git log” command. The specification of additional param-
     eters allows the retrieval of information regarding a given file or a complete
     directory. Additionally, the log command can also be used with a “stat” flag to
     indicate the number of LOC changes incurred in each affected file after a commit
     is issued. Figure 5.15 presents an example of Git log file for Apache’s log4j applica-
     tion (https://apache.googlesource.com/log4j).
In addition to the above differences, Git also maintains integrity (no change can be made
without the knowledge of Git) and, generally, Git only adds data. The following attributes
are recorded in the above commit record:
   • Commit: Indicates the check-sum for this commit. In Git, everything is check-
      summed prior to being stored and is then referenced by using that checksum.
Mining Data from Software Repositories                                                    165
 commit 1b0331d1901d63ac65efb200b0e19d7aa4eb2b8b
 Author: duho.ro <duho.ro@lge.com>
 Date:   Thu Jul 11 09:32:18 2013 + 0900
Bug: 9767739
FIGURE 5.15
Example log file for Android at Git.
     It implies that it is impossible to modify the contents of any artifact file or even
     directory without the knowledge of Git.
   • Date: This field records the date and time of the check in.
   • Author: The author field provides the information of the person who committed
     the change.
Free text: This field provides informal data or comments given by the author during the
commit. This field is of prime importance in extracting information for areas such as
defect prediction, wherein bug or issue IDs are required to identify a defect, and these can
be obtained after effectively processing this field. Following the free text, we have the list
of files that have been changed in this commit. The name of a changed file is followed by
a number which indicates the total number of LOC changes incurred in that file, which
in turn is followed by the number of LOC insertions (the count of occurrences of “+”) and
LOC deletions (the count of occurrences of “–”). However, a modified LOC is treated as a
line that is first deleted (–) and then inserted (+) after modifying. The last line summarizes
the total number of files changed, along with total LOC changes (insertions and deletions).
  Table 5.1 compares the following freely available features of the software repositories.
These repositories can be mined to obtain useful information for analysis.
  TABLE 5.1
  Comparison of Characteristics of CVS, SVN, and Git Repositories
  Repository
  Characteristics                 CVS                     SVN                       GIT
  Initial release          July 3, 1986          October 20, 2000           April 3, 2005
  Development language     C                     C                          C, Perl, Shell Script
  Maintained by            CVS Team              Apache                     Junio Hamano
  Repository type          Client server         Client server              Distributed
  License type             GNU-GPL (open         Apache/BSD style license   GNU-GPL v2 (open
                            source)               (open source)              source)
  Platforms supported      Windows, Unix, OS X   Windows, Unix, OS X        Windows, Unix, OS X
  Revision IDs             Numbers               Numbers                    SHA-1 hashes
  Speed                    Medium                High                       Excellent
  Ease of deployment       Good                  Medium                     Good
  Repository replication   Indirect              Indirect                   Yes (Git clone)
  Example                  Firefox               Apache, FreeBSD,           Chrome, Android,
                                                  SourceForge, Google        Linux, Ruby, Open
                                                  Code                       Office
  • Repository type: It describes the type of relationship that the various copies have
    with each other. In client–server model, the master copy is available on the server
    and the clients access them, and, in distributed model, users have local reposito-
    ries available with them.
  • License type: The license of the software.
  • Platforms supported: The operating system supported by the repository.
  • Revision IDs: The unique identifiers to identify releases and versions of the software.
  • Speed: The speed of the software.
  • Ease of deployment: How easily can the system be deployed?
  • Repository replication: How easily the repository can be replicated?
  • Example: Names of few popular softwares that use the specified VCS.
5.7.4 Bugzilla
Bugzilla is a popular bug tracking system, which provides access to bug reports for a large
number of OSS systems. Bugzilla database can easily be accessed through HTTP and the
defect reports can be retrieved from the database in the XML format. Bug reports thus
obtained can aid managers and developers in the identification of defect-prone modules or
files in the source code, which are candidates for redesign or reimplementation, and hence
analyze such files more closely and carefully (http://www.bugzilla.org).
   Additionally, contact information, mailing addresses, discussions, and other adminis-
trative information are also provided by Bugzilla. Some interesting patterns and infor-
mation for the evolutionary view of a software system, such as bug severity, affected
artifacts, and/or products or component may also be obtained from this bug database.
Figure  5.16 depicts the schema diagram of Bugzilla database (referenced from http://
bugzilla.org).
Mining Data from Software Repositories                                                                                                                                                             167
               shadowlog
                                                              milestones
id                int(11)                                                          milestones.product                       products                                              versions
                                                   value           varchar(20)     = products.product
ts                timestamp(14)                                                                                                                                      value           tinytext
                                                   product         varchar(64)                              product                    varchar(64)
reflected         tinyint(4)                                                                                                                                         program         varchar(64)
                                                   sortkey         smallint(6)                              description                mediumtext
command           mediumtext
                                                                                                            milestoneurl               tinytext
                                                                                                            disallownew                tinyint(4)           versions.program =
    attachstatusdefs.products = product.product                                                                                                              products.product
                                                                                                            votesperuser               smallint(6)
                                                                                                            maxvotesperbug             smallint(6)
                      components                                                                            votestoconfirm             smallint(6)
                                                      components.program = products.product                 defaultmilestone           varchar(20)
         value                   tinytext
         program                 varchar(64)
         initialowner            mediumint(9)                                         namedqueries
                                                                                                                                                                            keyworddefs
         initialqacontact        mediumint(9)                              userid            mediumint(9)
                                                                                                                                                                 id                 smallint(4)
         description             mediumtext                                name              varchar(64)
                                                                                                                                                                 name               varchar(64)
                                                                           watchfordiffs     tinyint(4)
                         components.initialowner = profiles.userid                                                                                               description        mediumtext
                       components.initialqacontact = profiles.userid       linkfooter        tinyint(4)
                                                                           query             mediumtext
                                                                                                                                                                         keyworddefs.id =
                    tokens                                                                                                                                             keywords.keywordid
      userid           mediumint(9)                                    namedqueries.userid
      issuedate        datetime            tokens.userid =               = profiles.userid
                                            profiles.userid                                                 groups                                                           keywords
      token            varchar(16)
                                                                                              bit                 bigint(20)                                   bug_id             mediumint(9)
      tokentype        varchar(8)
                                                                                              name                varchar(255)                                 keywordid          smallint(6)
      eventdate        tinytext
                                                                                              description         text
                                                                                              isbuggroup          tinyint(4)                                               keywords.bug_id
                  logincookies
                                                                                              userregexp          tinytext                bugs.product =                    = bugs.bug_id
       cookie        mediumint(9)                                                             isactive            tinyint(4)             products.product
       userid        mediumint(9)
       ipaddr        varchar(40)
       lastused      timestamp(14)                                                                                                                                          dependencies
                                                                                                                                                                 blocked            mediumint(9)
                         logincookies.userid                                                                                                                     dependson          mediumint(9)
                           = profiles.userid                                                    bugs.assigned_to = profiles.userid
                                                                                                 bugs.reporter = profiles.userid
                                                                                                                                                                dependencies.blocked = bugs.bug_id
                                                                                                bugs.qa_contact = profiles.userid
                                                                                                                                                               dependencies.dependson = bugs.bug_id
                    watch
      watcher        mediumint(9)
      watched        mediumint(9)                                                                                  profiles_activity
                                                                                                       userid               mediumint(9)
      watch.watcher = profiles.userid                                                                  who                  mediumint(9)                                      duplicates
      watch.watched = profiles.userid                                                                  profiles_when        datetime                               dupe_of         mediumint(9)
                                                        profiles
                                                                                                       fieldid              mediumint(9)                           dupe            mediumint(9)
                                          userid                mediumint(9)
                                                                                                       oldvalue             tinytext
                                          login_name            varchar(255)
                                                                                                       newvalue             tinytext                               duplicates.dupe_of = bugs.bug_id
                                          cryptpassword         varchar(34)                                                                                         duplicates.dupe = bugs.bug_id
       attachstatusdefs                   realname              varchar(255)
                                                                                   profiles_activity.userid = profiles.userid keywords.bug_id
 id               smallint(6)             groupset              bigint(20)                                                     = bugs.bug_id
                                                                                    profiles_activity.who = profiles.userid
 name             varchar(50)             disabledtext          mediumtext                                                                                                 bugs
 description      mediumtext              mybugslink            tinyint(4)
                                                                                                                                                     bug_id                        mediumint(9)
 sortkey          smallint(6)             blessgroupset         bigint(20)                                           fielddefs
                                                                                                                                                     groupset                      bigint(20)
 product          varchar(64)             emailflags            mediumtext                             fieldid           mediumint(9)
                                                                                                                                                     assigned_to                   mediumint(9)
                                                                                                       name              varchar(64)
                                                                                                                                                     bug_file_loc                  text
   attachstatusdefs.id =                                                                               description       mediumtext
   attachstatus.status_id                                                                                                                            bug_severity                  enum
                                  attachments submitter_id                                             mailhead          tinyint(4)
                                                                                                                                                     bug_status                    enum
                                       = profiles.userid                     bugs_activity.fieldid     sortkey           smallint(6)
                                                                                                                                                     creation_ts                   datetime
                                                                              = fielddefs.fieldid
        attachstatuses                                                                                                                               delta_ts                      timestamp(14)
                                                     bugs_activity.who                                     votes                                     short_desc                    mediumtext
 attach_id      mediumint(9)
                                                      = profiles.userid                                                            votes.bug_id      op_sys                        enum
 statusid       smallint(6)                                                                   who          mediumint(9)
                                                                                                                                  = bugs.bug_id      priority                      enum
                                                                                              bug_id       mediumint(9)
                                                                                                                                                     product                       varchar(64)
    attachstatuses.attach_id                                                                  count        smallint(6)
    = attachments.attach_id                           bugs_activity                                                                                  rep_platform                  enum
                                            bug_id            mediumint(9)                                                                           reporter                      mediumint(9)
                                                                                                      longdescs
                                            attach_id         mediumint(9)                                                                           version                       varchar(16)
                                                                                           bug_id         mediumint(9)           longdesos.bug_id
             attachments                    who               mediumint(9)                                                         = bugs.bug_id     component                     varchar(50)
                                                                                           who            mediumint(9)                               resolution                    enum
 attach_id        mediumint(9)              bug_when          datetime                     bug_when       datetime                                   target_milestone              varchar(20)
 bug_id           mediumint(9)              fieldid           mediumint(9)
                                                                                           thetext        mediumtext                                 qa_contact                    mediumint(9)
 creation_ts      timestamp(14)             added             tinytext
 description      mediumtext                                                                                                                         status_whiteboard             mediumtext
                                            removed           tinytext             bugs_activity.bug_id = bugs.bug_id
 mimetype         mediumtext                                                                                                                         votes                         mediumint(9)
                                            attach_id         mediumint(9)
 ispatch          tinyint(4)                                                                         CC                                              keywords                      mediumtext
                                          bugs_activity.attach_id.                                                cc.bug_id = bugs.bug_id            lastdiffed                    datetime
 filename         mediumtext                                                           bug_id        mediumint(9)
                                           attachments.attach_id                                                                                     everconfirmed                 tinyint(4)
 thedata          longblob                                                             who           mediumint(9)
 submitter_id     mediumint(9)           attachments.bug_id = bugs.bug_id                                                                            reporter_accessible           tinyint(4)
 isobsolete       tinyint(4)                                                                                                                         cclist_accessible             tinyint(4)
FIGURE 5.16
Bugzilla database schema.
168                                                                 Empirical Research in Software Engineering
New Unconfirmed
Resolved
                         Resolved                                                     Bug
                                                                                          is closed
                           bug        Unsatisfactory
                                                                 Development is
                                       solution of                  verified
                                         the bug
                Reopened                                             Verified
                                       Bug is reopened 
               Bug is reopened 
                                                               Bug is closed
Closed
FIGURE 5.17
Life cycle of a bug at Bugzilla (http://www.bugzilla.org).
  Figure 5.17 depicts the life cycle stages of a bug, in the context of Bugzilla (version 3.0).
Figure  5.18 presents a sample XML report, which provides information regarding
a defect, reported in the browser product of Mozilla project (http://www.bugzilla.
mozilla.org).
  The following fields are contained in the above bug report record:
   • Bug ID: This is the unique ID assigned to each bug reported in the software
     system.
   • Bug status: This field contains information about the current status or state of the
     bug. Some of the possible values include unconfirmed, assigned, resolved, and so on.
     The status whiteboard can be used to add notes and tags regarding a bug.
   • Product: This field implies the product of the software project that is affected by a
     bug. For Mozilla project, some products are Browser, MailNews, NSPR, Phoenix,
     Chimera, and so on.
Mining Data from Software Repositories                                                      169
</long_desc>
FIGURE 5.18
Sample bug report from Bugzilla.
although both of these serve different purposes, but their capabilities are such that they
complement one another really well.
  We know that a VCS maintains its repository, the required information regarding each
and every change incurred in the source code of a software project under versioning.
Through the change logs maintained by the VCS, we may come across certain changes
that had been incurred for bug fixing. So we wonder that if VCS can also provide bug
information, then what is the need of a bug-tracking system like Bugzilla? Well, as we
have discussed in the previous section, we can obtain detailed information about a bug
from Bugzilla, such as the bug life cycle, the current state of a bug, and so on. All this
information cannot be obtained by a VCS.
  Similarly, a bug repository can neither provide information regarding the changes that
were incurred for purposes other than defect-fixing, nor does it store the different versions
and details of a project (and its artifacts) that are maintained in the VCS.
  Therefore, some organizations typically employ both a VCS and a bug-tracking system
to serve the dual purpose of versioning and defect data management. For instance, the
Mozilla open source project is subject to versioning under the CVS repository, while the
bugs for that project are reported in the Bugzilla database. For such a project, we may
obtain the bug IDs from the change logs of the VCS and then link or map these bug IDs to
the ones stored in the Bugzilla database. We may then obtain detailed information about
both the committed changes, and the bugs that were fixed by these changes. We have
also stated in Section 5.7.2 related to SVN that the SVN change logs contain an optional
Bugzilla ID field, which may be employed to link the change log information provided by
SVN to the bug information in the Bugzilla database.
The [remote repository URL] indicates the URL of a git repository and must be specified.
The [exact branch name] of the main trunk needs to be specified, if we wish to clone a spe-
cific branch/version of the repository. If the branch is not specified, then the trunk (i.e., the
Mining Data from Software Repositories                                                      171
latest version) will be cloned. The [destination path of end-user machine] may be specified,
if the user wishes to download the repository in a specific location on his machine.
   For example, the following clone command will download the repository for Android
OS “Contacts” Application, for the branch “android_2.3.2_r1” to the destination “My
Documents”:
5.8.2 Metrics
Predominantly, the source code of a software system has been employed in the past to
gather software metrics, which are, in turn, employed in various research areas, such as
defect prediction, change proneness, evaluating the quality of a software system, and
many more (Hassan 2008).
  For the validation of impact of OO metrics on defect and change proneness, various
studies have been conducted in the past with varied set of OO metrics. These studies show
that Chidamber and Kemerer (CK) metric suite remains the most popularly employed
metric suite in literature.
  Studies carried out by various researchers (Basili et al. 1996; Tang et al. 1999; Briand et al.
2000a; El Emam et al. 2001a; Yu et al. 2002; Gyimothy et al. 2005; Olague et al. 2007; Elish
et  al. 2011; Malhotra and Jain 2012) show that OO metrics have a significant impact on
defect proneness. Several studies have also been carried out to validate the impact of OO
metrics on change proneness (Chaumum et al. 1999; Bieman et al. 2003; Han et al. 2010;
Ambros et al. 2009; Zhou et al. 2009; Malhotra and Khanna 2013). These also reveal that OO
metrics have a significant impact on change proneness.
  However, most of these studies relied on metric data that was made publically available,
or obtained the data manually, which is a time-consuming and error-prone process.
  However, in the study to investigate the relationship between OO metrics and change
proneness, Malhotra and Khanna (2013) had effectively analyzed the source code of soft-
ware repositories (Frinika, FreeMind, and OrDrumbox) to calculate software metrics in
an automated and efficient manner. They had gathered the source code for two versions
of each of the considered software systems and then, with the help of Understand for
Java (http://www.scitools.com/) software, they had collected the metrics for the previ-
ous version (Frinika—0.2.0, FreeMind—0.9.0 RC1, OrDrumbox—0.8.2004) of the software
systems. Various OO and size metrics were collected and analyzed, including CK metrics.
The software Understand for Java gives the metrics at various levels of detail such as files,
methods, classes, and so on. Thus, metrics were collected for all the classes in the software
systems. They assessed and predicted changes in the classes. The Understand software
also provides metrics for the “unknown” classes. These values must be discarded, as such
classes cannot be accessed.
  Malhotra and Jain (2012) had also carried out a study to propose a defect prediction
model for OSS systems, wherein the focus was on the applicability of OO metrics in pre-
dicting defect-prone classes. The metrics were collected by using CKJM metrics tool for
calculating CK metrics. It is an open source application that calculates metrics for various
suites, including CK and quality metrics for OO design (QMOOD). It operates on the Java
byte code files (i.e., .class files), which can be obtained from the source code of OSS systems
hosted at various repositories.
Mining Data from Software Repositories                                                   173
  These studies have proven to be appropriate examples for how source code obtained
from software repositories may be analyzed effectively and thus add to the value to soft-
ware repository mining field. Tools such as Understand and CKJM, to a certain extent,
advocate the importance of analyzing the source code obtained from software repositories
and its application in popular research areas.
  Now, we discuss some of the tools that generate the data for OO metrics for a given
software project.
  1. Understand
     It is a proprietary and paid application developed by SciTools (http://www.sci-
     tools.com). It is a static code analysis software tool and is mainly employed for
     purposes such as reverse engineering, automatic documentation, and calcula-
     tion of source code metrics for software projects with large size or code bases.
     Understand basically functions through an integrated development environment
     (IDE), which is designed to aid the maintenance of old code and understanding
     new code by employing detailed cross references and a wide range of graphical
     views. Understand supports a large number of programming languages, includ-
     ing Ada, C, the style sheet language CSS, ANSI C, and C++, C#, Cobol, JavaScript,
     PHP, Delphi, Fortran, Java, JOVIAL, Python, HTML, and the hardware descrip-
     tion language VHDL. The calculated metrics include complexity metrics (such as
     McCabe’s CC), size and volume metrics (such as LOC), and other OO metrics (such
     as depth of inheritance tree [DIT] and coupling between object classes [CBO]).
  2. CKJM Calculator
     CKJM is an open source application written in the Java programming language
     (http://gromit.iiar.pwr.wroc.pl/p_inf/ckjm/metric.html). It is intended to calcu-
     late a total of 19 OO metrics for systems developed using Java. It supports vari-
     ous OO metrics, including coupling metrics (CBO, RFC, etc.), cohesion metrics
     (lack of cohesion in methods [LCOM] of a class, cohesion among methods of a
     class, etc.), inheritance metrics (DIT, number of children [NOC], etc.), size metrics
     (LOC), complexity metrics (McCabe’s CC, average CC, etc.), and data abstraction
     metrics. The tool operates on the compiled source code of the applications, that is,
     on the byte code or .class files and then calculates different metrics for the same.
     However, it simply pipes the output report to the command line. But, it can be
     easily embedded in another application to generate metric reports in the desired
     format.
  3. Source Monitor
     It is a freeware application written in C++ and can be used to measure metrics for
     software projects written in C++, C, C#, VB.NET, Java, Delphi, Visual Basic (VB6),
     or HTML (www.campwoodsw.com/sourcemonitor.html). This tool collects met-
     rics in a fast, single pass through the source files. The user can print the metrics
     in tables and charts, and even export metrics to XML or CSV files for further
     processing. Source monitor supports various OO metrics, predominantly code
174                                                Empirical Research in Software Engineering
       metrics. Some of the code metrics provided by source monitor include: percent
       branch statements, methods per class, average statements per method, and maxi-
       mum method or function complexity.
  4.   NDepend
       It is a proprietary application developed using .NET Framework that can perform
       various tasks for systems written in .NET, including the generation of 82 code and
       quality metrics, trend monitoring for these metrics, exploring the code structure,
       detect dependencies, and many more (www.ndepend.com). Currently, it provides
       12 metrics on application (such as number of methods, LOC, etc.), 17 metrics on
       assemblies (LOC and other coupling metrics), 12 metrics on namespaces (such as
       afferent coupling and efferent coupling at the namespace level), 22 metrics on type
       (such as NOC and LCOM), 19 metrics on methods (coupling and size metrics), and
       two metrics on fields (size of instance and afferent coupling at the field level). It
       can also be integrated with Visual Studio and performs lightweight and fast analy-
       sis to generate metrics. It is useful for the real-world applications.
  5.   Vil
       It is a freeware application that provides different functionalities, including met-
       rics, visualization, querying, and analysis of the different components of applica-
       tions developed in .NET (www.1bot.com). The .NET components supported are
       assemblies, classes, and methods. It works for all of the .NET languages, including
       C# and Visual Basic.NET. It provides a large (and growing) suite of metrics per-
       taining to different entities such as classes, methods, events, parameters, fields,
       try/catch blocks, and so on, reported at multiple levels. Vil also supports various
       class cohesion, complexity, inheritance, coupling dependencies, and data abstrac-
       tion metrics. Few of these metrics include: CC, LCOM, CBO, instability, distance,
       afferent, and efferent couplings.
  6.   Eclipse Metrics Plugin
       It is an open source Eclipse plugin that calculates various metrics for code written
       in Java language during build cycles (eclipse-metrics.sourceforge.net). Currently,
       the supported metrics include McCabe’s CC, LCOM, LOC in method, number of
       fields, number of parameters, number of levels, number of locals in scope, efferent
       couplings, number of statements, and weighted methods per class. It also warns
       the user of “range violations” for each of the calculated metric. This enables the
       developer to stay aware of the quality of the code he has written. The developer
       may also export the metrics to HTML web page or to CSV or XML file formats for
       further analysis.
  7.   SonarQube
       It is an open source application written in the Java programming language (www.
       sonarqube.org). Mostly employed for Java-based applications, however, it also
       supports various other languages such as C, C++, PHP, COBOL, and many more.
       It also offers the ability to add our own rules to these languages. SonarQube pro-
       vides various OO metrics, including complexity metrics (class complexity, method
       complexity, etc.), design metrics (RFC, package cycles, etc.), documentation met-
       rics, (comment lines, comments in procedure divisions, etc.), duplication metrics
       (duplicated lines, duplicated files, etc.), issues metrics (total issues, open issues,
       etc.), size metrics (LOC, number of classes, etc.), and test metrics (branch coverage,
       total coverage, etc.).
Mining Data from Software Repositories                                                    175
   8. Code Analyzer
      It is an OSS written in the Java programming language, which is intended for
      applications developed in C, C++, Java, Assembly, and HTML languages (source-
      forge.net/projects/codeanalyze-gpl). It calculates the OO metrics across multiple
      source trees of a software project. It offers flexible report capabilities and a nice
      tree-like view of software projects being analyzed. The metrics calculated include
      Total Files (for multiple file metrics), Code Lines/File (for multiple file metrics),
      Comment Lines/File (for multiple file metrics), Comment Lines, Whitespace Lines,
      Total Lines, LOC, Average Line Length, Code/Comments ratio, Code/Whitespace
      ratio, and Code/(Comments + Whitespace) ratio. In addition to the predefined
      metrics, it also supports user-defined software source metrics.
   9. Pylint
      It is an open source application developed using Python programming language,
      which is intended for analyzing the source code of applications written in the
      Python language, and looks for defects or bugs and reveals possibly signs of poor
      quality (www.pylint.org). Pylint displays to the user a number of messages as it
      analyzes the Python source code, as well as some statistics regarding the warn-
      ings and errors found in different files. The messages displayed are generally
      classified into different categories such as errors and warnings. Different metrics
      are generated on the basis of these statistics. The metrics report displays summa-
      ries gathered from the source code analysis. The details include: a list of external
      dependencies found in the code; where they appear; number of processed mod-
      ules; the total number of errors and warnings for each module; the percentage
      of errors and warnings; percentage of classes, functions, and modules with doc-
      strings; and so on.
  10. phpUnderControl and PHPUnit
      phpUnderControl is an add-on tool for the well-known continuous integration
      tool named the CruiseControl (http://phpUnderControl.org). It is an open source
      application written in Java and PHP languages, which aims at integrating some
      of the best PHP development tools available, including testing and software met-
      rics calculator tools. PHPUnit is a tool that provides a framework for automated
      software tests and generation of various software metrics (http://phpunit.de). The
      software predominantly generates a list of various code, coverage, and test met-
      rics, such as unit coverage, LOC, test to code ratio, and so on. The reports are gen-
      erated in XML format, but phpUnderControl comes with a set of XSL style sheets
      that can format the output for further analysis.
Table 5.2 summarizes the features of the above-stated source code analysis tools.
TABLE 5.2
Summary of Source Code Analysis Tools
                                                           Programming
                                            Source          Language(s)
Tool                  Availability         Language          Supported            Metrics Provided
repository analysis for any software system. Figure 5.19 gives an overview of the various
applications of software historical analysis. The applications are explained in the subsec-
tions presented below.
                                                       Defect
                                                     prediction
                                                                      Trend analysis
                               Pattern
                               mining
                                                    Mining                     Measures/
               Change impact                      repositories                  metrics
                  analysis                       (applications)
                                                                    Effort
                                   Design                         estimation
                                 correction
                                                   Change
                                                  prediction
FIGURE 5.19
Applications of data mined from software repositories.
  Thus, using the historical data, many unexpected dependencies can be easily revealed,
explained, and rationalized.
also very crucial and needs to be evaluated accurately. Prediction of change-prone classes
may aid in maintenance and testing. A class that is highly probable to change in the later
releases of a software needs to be tested rigorously, and proper tracking is required for that
class, while modifying and maintaining the software (Malhotra and Khanna 2013).
  Therefore, various studies have also been carried out in the past for predicting effective
change-proneness models, and to validate the impact of OO metrics on change proneness.
Historical analysis may be effectively employed for change-proneness studies. Similar to
defect prediction studies, researchers have also adopted various novel techniques that
analyze vast and varied data regarding a software system, which is available through
historical repositories to discover probable changes in a software system.
5.10.1 FLOSSmole
Its former name was OSS mole. FLOSSmole is a project that has been collaboratively
designed to gather, share, and store comparable data and is used for the analysis of
free and OSS development for the purpose of academic research (http://flossmole.org).
FLOSSmole maintains data and results about FLOSS projects that have not been developed
in a centralized manner.
   The FLOSSmole repository provides data that includes source code, project meta-
data and characteristics (e.g., programming languages, platform, target audience, etc.),
developer-oriented information, and issue-tracking data for various software systems.
   The purpose of FLOSSmole is to provide widely used data sets of high quality, and shar-
ing of standard analyses for validation, replication, and extension. The project contains the
results of collective efforts of many research groups.
Mining Data from Software Repositories                                                           181
Table 5.3
Summary of Software Repositories and Data Sets
Repository              Web Link               Source       Data Format     Sources         Public
5.10.2 FLOSSMetrics
FLOSSMetrics is a research project funded by the European Commission Sixth Framework
Program. The primary goal of FLOSSMetrics is to build, publish, and analyze a large scale
of projects, and also to retrieve information and metrics regarding the libre software devel-
opment using pre-existing methodologies and tools that have already been developed.
The project also provides its users with a public platform for validating and industrially
exploiting the obtained results.
  As of now, four types of repository metrics are offered: source code management infor-
mation, code metrics (only for files written in C), mailing lists (archived communications),
and bug-tracking system details. The project, while focusing on the software project
182                                               Empirical Research in Software Engineering
development itself, also provides valuable information regarding the actors or developers,
the project artifacts and source code, and the software processes.
  The FLOSSMetrics is currently in its final stage, and some of the results and databases/
data sets are already available. To be specific, the FLOSS community has proven to be a
great opportunity for enhancing and stimulating empirical research in the scope of soft-
ware engineering. Additionally, there are thousands of projects available in that commu-
nity, and a majority of them provide their source code and artifact repositories to everyone
for any application.
  Sourcerer’s managed repository stores and maintains the local copies of software
projects that have been garnered from numerous open source repositories. As of now, the
repository hosts as many as 18,000 Java-based projects obtained from Apache, Java.net,
Google Code, and SourceForge.
  Additionally, the project provides Sourcerer DB, which is a relational database whose
structure and reference information are extracted from the projects’ source code. Moreover,
a code search engine has also been developed using the Sourcerer infrastructure.
  • Bugs search, that is, multicriteria search engine for information related to bugs.
  • Debian maintainer dashboard that provides information regarding the Debian
    project and its development.
  • Bugs usertags, which allow the users to search for user-specified tag on bugs.
  • Sponsors stats, which provides some statistics regarding who is sponsoring
    uploads to Debian.
  • Bapase, which allows the users to look for different packages using various criteria.
  • Apply a prediction or modeling technique based on the source code metrics, his-
    torical measures, or software process information (obtained from the CVS log
    data, etc.)
  • Evaluate the performance of the prediction technique by comparing the results
    obtained with the actual number of postrelease bugs reported in a bug-tracking
    system
The repository has been designed to perform bug prediction in a given software system
at the class level. However, we can also derive the package or subsystem information by
aggregating the data for each class, because with each class, the package that contains it
is specified. For each system hosted at the repository, the data set includes the following
information:
184                                                Empirical Research in Software Engineering
test suites, defect data, and so on. The repository also maintains documentation on how
to employ these artifacts for experimentation purposes, supporting tools, and method-
ologies that facilitate experimentation and gathering of useful information regarding the
processes used to maintain and enhance the artifacts, and supporting tools that aid these
processes.
  The SIR repository data is freely made available to the users after they register with the
project community and agree to the terms specified in SIR license.
5.10.11 Ohloh
Ohloh is a free, public directory of FOSS (Free and/or OSS), and also contains the infor-
mation about the members and contributors who develop and maintain it (http://www
.ohloh.net). Ohloh source code and repository is publicly available. Basically, it provides a
free code search location or site that keeps an indexing for most of the projects hosted at
Ohloh. Ohloh can be edited or modified by everyone, just like a wiki. Anyone can join and
add new projects, and even make modifications to the existing project pages. Such public
reviews have helped to make Ohloh one of the biggest, most accurate, and up-to-date FOSS
directories available.
  Ohloh does not host software projects and source code. Instead, it is a community, a
directory, and analytics and search service. Ohloh can generate reports regarding the com-
position and activity of software project source code by connecting to the corresponding
source code repositories, analyzing the source code’s history updates being made cur-
rently, and attributing the updates to their respective contributors. It also aggregates this
data to track the changing nature of the FOSS world.
  Additionally, Ohloh provides various tools and methodologies for comparing projects,
languages, repositories, and analyzing language statistics. Popular projects accessible
from Ohloh include Google Chrome, Mozilla, WebKit, MySQL, Python, OpenGL, and
many more. Ohloh is owned and operated by Black Duck software.
  • Variation in project development team size over time, that is, the number of devel-
    opers as a function of time.
  • Participation of developers on projects, that is, number of projects in which
    individual developers participate.
186                                              Empirical Research in Software Engineering
  • The above two measures are used to form what is known as a “collaboration social-
    network.” This is used to obtain scale-free distributions among project activity
    and developer activity.
  • The extended-community size for each project, which includes the number of
    project developers along with the registered members who have participated in
    any way in the development life cycle of a project, such as discussions on forums,
    bug reporting, patch submission, and so on.
  • Date of creation at SourceForge.net for each software project.
  • Date of the first release for each software project.
  • Ranking of the projects as per SourceForge.net.
  • Category-wise distribution of projects, for example, databases, games, communi-
    cations, security, and so on.
5.10.14 Tukutuku
The objectives of the Tukutuku benchmarking project are: first, data gathering on web
projects that will be used to build company-specific or generic cost estimation models that
will enable a web company to enhance its current cost estimation practices; and second, to
enable a web company to benchmark its productivity within and across web companies
(http://www.metriq.biz/tukutuku).
  Till date the Tukutuku benchmarking project has gathered data on 169 web projects
worldwide, and this data has been used to help several web companies.
Factual information from any source code or version control can be published by it.
Nevertheless, the first release only contains Java code and SVN. The second release is
expected to cover C#, C++, CVS, and Git.
  The information contained in it has been extracted from source code, versioning, and
bug/issue systems. The information pieces have been interconnected explicitly. The data
has been extracted from approximately 18,000 open source projects with as much as
1,500,000 files and nearly 400,000,000 LOC. It is a multipurpose project. Its applications
include mostly software research, software documentation/traceability, and enhancing
the future of software development.
FIGURE 5.20
Defect collection and reporting system.
188                                              Empirical Research in Software Engineering
kind of software repositories are suitable for extracting such kind of data. After a rigor-
ous study and analysis, we have chosen those software repositories that employ “Git” as
the VCS (http://git-scm.com). The reasons for selecting Git-based software systems are
as follows:
   1. Git is the most popular VCS and the defect and change data can be extracted in
      relatively easier ways through the change logs maintained by Git.
   2. A large number of OSS systems are maintained through Git, including Google’s
      Android OS.
Thus, by employing DCRS, we can easily obtain defect and change data for a large number
of OSS repositories that are based on Git VCS.
5.11.2 Motivation
Previous studies have shown that bug or defect data collected from open source proj-
ects may be employed in research areas such as defect prediction (defect proneness)
(Malhotra and Singh 2012). For instance, some commonly traversed topics in defect
prediction include analysis and validation of the effect of a given metric suite (such
as CKJM and QMOOD) on defect-proneness in a system (Aggarwal et  al. 2009); and
evaluating the performance of pre-existing defect proneness models, such as machine
learning methods (Gondra 2008). But, unfortunately, there exists no mechanism that can
collect the defect data for Git-based OSS such as Android, and provide useful informa-
tion for the above-stated areas.
   Thus, a system is required that can efficiently collect defect data for a Git-based OSS,
which in turn, to say the least, might be used in the above mentioned research areas. Such
a system is expected to perform the following operations: First, obtain the defect logs of
the software’s source code and filter them to obtain the defects that were present in a
given version of that software and have been fixed in the subsequent version. Then, the
system should process the filtered defect logs to extract useful defect information such as
unique defect identifier and defect description, if any. The next task that the system should
perform is the association of defects to their corresponding source files (Java code files,
or simply class files in the source code). In the next step, it should perform the computa-
tion of total number of fixed defects for each class, that is, the number of defects that have
been associated with that class. Finally, the corresponding values of different metric suites
should be obtained by the system for each class file in the source code of previous version
of the OSS.
   The DCRS incorporates each and every functionality stated above and, consequently,
generates various reports that contain the collected defect data in a more processed, mean-
ingful, and useful form.
information regarding the modifications that have been made from time to time in the
source code. These modifications could be for various purposes, such as defect fixing,
refactoring, enhancements, and so on. Each and every change incurred in the source
code, no matter how big or small, is recorded in the change log and thus forms an indi-
vidual change or modification record. An individual change record provides information
such as:
  •   The Timestamp of commit (i.e., recording the changes with Git repository)
  •   Unique change identifier
  •   Unique defect identifier, if the change has fixed any bug(s)
  •   An optional change description
  •   List of the modified source code files, along with the changes in LOC for each
      changed file
Figure 5.21 presents the generation of change logs by the DCRS for the Android applica-
tion of “Mms.”
   Processing of both the version’s source code is necessary, because a change log contains
change information from the beginning of time (i.e., when the software was released for
the first time), but we are interested only in the changes that have been incurred during
the transition from previous version to the next one (e.g., from Android v4.0 to v4.2, for our
demonstration of DCRS on Android OS).
   Figure  5.22 depicts a change log record for an Android application package named
“Mms”:
   In the next operation, these change logs are further processed, one record at a time,
to get defect records (i.e., changes that have been made for defect fixes, not for other
reasons like refactoring, enhancement, etc.). Defect IDs and the defect description, if
any, are retrieved from the defect logs. Thus, a defect record differs from a change
record only in one aspect that a defect record has at least one defect identifier. In other
words, a defect log must contain the defect identifier(s) of the defect(s) that was/were
fixed in the corresponding change. A description of the defect(s) may or may not be
provided. In the latter case, a predefined value of description is stored for such defect
records.
   These defect IDs are finally mapped to classes in the source code. Only Java source code
or class files are considered and other file formats are ignored. The defect data collected is
thus used to accordingly generate various reports in .csv format, which are described later
in this chapter.
   Figure  5.23 presents the screen that generates various defect reports for the Android
“Mms” Application.
   The change logs required for the above process can only be obtained through the
usage of appropriate Git Bash commands. Hence, Git Bash becomes a dependency for
the DCRS, and if not installed and configured correctly in the system, DCRS will not
work at all.
   The entire DCRS system has been implemented in Java programming language
(Java SE 1.7), using Eclipse RCP-Juno IDE. The data for required metrics for the class
files of previous version of an OSS has been obtained using CKJM tool that covers
a wide range of metrics (McCabe 1976; Chidamber and Kemerer 1994; Henderson
1996; Martin 2002). The procedure for defect collection and reporting is presented in
Figure 5.24.
190                                             Empirical Research in Software Engineering
FIGURE 5.21
Generation of change logs by DCRS.
TIMESTAMP 1386010429
Change-Id: Id7a2faa6997036507acad38f43fe17bf1f6a42cd
 AndroidManifest.xml                           | 1 +
 src/com/android/mms/ui/ClassZeroActivity.java | 77 +++++++++++++++++++++------
FIGURE 5.22
Change record from the change log file of Android “Mms” application.
  As we are interested only in the Java source code or class files, we have analyzed a few
of the available Android application packages to determine the fraction of Java source
code files in each of these packages. It was noted that there are significantly fewer number
of Java source code files as compared to other files types, such as layout files, media files,
string value files, and so on, in every application package that has been analyzed.
  The Android application packages were “cloned” (downloaded) by the DCRS itself. This
functionality is discussed in detail in Section 5.8.
FIGURE 5.23
Generation of defect reports by DCRS.
Figure 5.25 presents the partial defect details report for Android “Mms” application.
Mining Data from Software Repositories                                                                 193
                                                            Processing to
                                    Retrieve LOC            obtain defect
                                 changes class-wise             logs
                                  for every change
                                                                                      Obtain metrics
                                                                                      and total LOC
                                                                                         changes
Defect reports
FIGURE 5.24
Flowchart for defect collection and reporting process.
FIGURE 5.25
Example of defect details report records.
To summarize, the fields that are contained in this report are as follows:
Figure 5.26 presents the partial defect count report for Android “Mms” application.
FIGURE 5.26
Example of defect count report records.
Figure 5.27 presents the partial LOC changes report for Android “Mms” application.
  In addition to these, the following auxiliary reports are also generated by the DCRS,
which might be useful for a statistical comparison of the two versions of Android OS
application we have considered:
FIGURE 5.27
Example of LOC changes report records.
To summarize, the fields that are contained in this report are as follows:
Figure 5.28 presents the newly added source files report for Android “Mms” application.
FIGURE 5.28
Example of newly added files report records.
FIGURE 5.29
Example of deleted files report records.
Figure 5.29 presents the deleted source files report for Android “Mms” application.
FIGURE 5.30
Example of consolidated defect and change report records.
for any other purpose, such as enhancement, refactoring, and so on. In other words, this
report can be considered as the combination of bug count report and LOC changes report
(described in Section 5.11.5.3), where we report the bug data from bug count report, but
LOC changes incurred for each class are reported from the LOC changes report.
  Figure  5.30 presents the partial consolidated defect and change report for Android
“Mms” application.
   • Standard deviation
   • Median (or 50 percentile)
   • 25 and 75 percentile
Figure 5.31 presents the descriptive statistics report for Android “Mms” application.
FIGURE 5.31
Example of descriptive statistics report records.
200                          Empirical Research in Software Engineering
FIGURE 5.32
Cloning operation of DCRS.
Mining Data from Software Repositories                                                     201
5.11.6.2 Self-Logging
Self-logging or simply logging may be defined as a process of automatically recording
events, data, and/or data structures about a tool’s execution to provide an audit trail. The
recorded information can be employed by developers, testers, and support personal for
identifying software problems, monitoring live systems, and for other purposes such as
auditing and postdeployment debugging. Logging process generally involves the transfer
of recorded data to monitoring applications and/or writing the recorded information and
appropriate messages, if any, to files.
  The tool also provided the user with the functionality to view the operational logs of the
tool. These self-logs are stored as text file(s), indicating the different events, and/or oper-
ations that have occurred during the tool’s working along with their timestamp. These
are ordered by the sequence and hence, the timestamp. Figure 5.33 presents an example
self-log file for DCRS.
  The self-log file follows a daily rolling append policy, that is, the logs for a given day
are appended to the same file, and a new file is created after every 24 hours. The previ-
ous day’s file is stored with a name that indicates the time of creation. Java Libraries of
LogBack and SLF4J have been employed to implement self-logging in the DCRS. They can
be downloaded from http://www.slf4j.org/download.html.
202                                              Empirical Research in Software Engineering
FIGURE 5.33
Example self-log file of DCRS.
  • Defect prediction and related research work or studies, including analysis and
    validation of the effect of a given metric suite on defect proneness and the eval-
    uation and comparison of various techniques in developing defect-proneness
    models, such as statistical and machine learning methods.
  • Statistical comparison of two given versions of an OSS (which is based on Git
    VCS), in terms of the source files that have been added in the newer version, the
    source files that were present in the previous version but have been deleted in the
    newer version, and the defects that have been fixed by the newer version.
Exercises
  5.1 Briefly describe the importance of mining software repositories.
  5.2 How will you integrate repositories and bug tracking systems?
  5.3 What are VCS? Compare and contrast different VCS.
204                                             Empirical Research in Software Engineering
Further Readings
An in-depth description the “Git” VCS may be obtained from:
  S. Charon, and B. Straut, Pro Git, Apress, 2nd edition, 2014, https://git-scm.com/book.
Mining Data from Software Repositories                                                     205
The documentation (user guide, mailing lists, etc.) of the “CVS” client—“TortoiseCVS”—
covers the basics and working of the CVS:
https://www.tortoisecvs.org/support.shtml.
Malhotra and Agrawal present a unique defect and change data-collection mechanism by
mining CVS repositories:
  R. Malhotra, and A. Agrawal, “CMS tool: Calculating defect and change data from
     software project repositories,” ACM Software Engineering Notes, vol. 39, no. 1, pp.
     1–5, 2014.
The following book documents and describes the detailed of Apache Subversion™ VCS:
A detailed analysis software development history for change propagation in the source
code has been carried out by Hassan and Holt:
  A.E. Hassan, and R.C. Holt, “Predicting change propagation in software systems,”
    Proceedings of the 20th IEEE International Conference on Software Maintenance, IEEE
    Computer Society Press, Los Alamitos, CA, pp. 284–293, 2004.
Ohira et al. present a case study of FLOSS projects at SourceForge for supporting cross-
project knowledge collaboration:
  http://en.wikipedia.org/wiki/Comparison_of_revision_control_software.
  Version Control System Comparison. http://better-scm.shlomifish.org/comparison/
     comparison.html.
  D.J. Worth, and C. Greenough, “Comparison of CVS and Subversion,” RAL-TR-2006-001.
The research data can be analyzed using various statistical measures and inferring
conclusions from these measures. Figure 6.1 presents the steps involved in analyzing and
interpreting the research data. The research data should be reduced in a suitable form
before it can be used for further analysis. The statistical techniques can be used to prepro-
cess the attributes (software metrics) so that they can be analyzed and meaningful conclu-
sions can be drawn out of them. After preprocessing of the data, the attributes need to be
reduced so that dimensionality can be reduced and better results can be obtained. Then,
the model is predicted and validated using statistical and/or machine learning techniques.
The results obtained are analyzed and interpreted from each and every aspect. Finally,
hypotheses are tested and decision about the accuracy of model is made.
  This chapter provides a description of data preprocessing techniques, feature reduction
methods, and tests for statistical testing. As discussed in Chapter 4, hypothesis testing can be
done either without model prediction or can be used for model comparison after the models
have been developed. In this chapter, we present the various statistical tests that can be applied
for testing a given hypothesis. The techniques for model development, methods for model vali-
dation, and ways of interpreting the results are presented in Chapter 7. We explain these tests
with software engineering-related examples so that the reader gets an idea about the practical
use of the statistical tests. The examples of model comparison tests are given in Chapter 7.
6.1.1.1 Mean
Mean can be computed by taking the average values of the data set. Mean is defined as the
ratio of sum of values of the data points to the total number of data points and is given as,
                                                                                              207
208                                                           Empirical Research in Software Engineering
FIGURE 6.1
Steps for analyzing and interpreting data.
                                                              ∑N
                                                                     xi
                                               Mean ( µ ) =
	                                                             i =1
where:
 xi (i = 1, . . . N) are the data points
 N is the number of data points
For example, consider 28, 29, 30, 14, and 67 as values of data points.
  The mean is (28 + 29 + 30 + 14 + 67) 5 = 33.6.
6.1.1.2 Median
The median is that value which divides the data into two halves. Half of the number of
data points are below the median values and half number of the data points are above the
median values. For odd number of data points, median is the central value, and for even
number of data points, median is the mean of the two central values. Hence, exactly 50%
of the data points lie above the median values and 50% of data points lie below the median
values. Consider the following data points:
The median is at 4th value, that is, 10. If one more additional data point 40 is added to the
above distribution then,
                                                  10 + 15
                                     Median =             = 12.5
	                                                    2
Median is not useful, if number of categories in the ordinal type of scale are very low.
In such cases, mode is the preferred measure of central tendency.
6.1.1.3 Mode
Mode gives the value that has the highest frequency in the distribution. For example,
consider Table 6.1, the second category of fault severity has the highest frequency of 50.
Hence, 2 can be reported as the mode for Table 6.1 as it has the highest frequency.
  Unlike the mean and median, the same distribution may have multiple values of mode.
Consider Table 6.2, there are two categories of maintenance effort with same frequency:
very high and medium. This is known as bimodal distribution.
  The major disadvantage of mode is that it does not produce useful results when applied
to interval/ratio scales having many values. For example, the following data points
represent the number of failures occurred per second, while testing a given software and
are arranged in ascending order:
It can be seen that the data is centered around 60–80  number of failures. But the mode
of the distribution is 18, since it occurs twice in the distribution whereas the rest of the
values only occur once. Clearly, the mode does not represent the central values in this case.
Hence, either other measures of central tendency will be useful in this case or the data
should be organized in suitable class intervals before mode is computed.
                                 TABLE 6.1
                                 Faults at Severity Levels
                                 Fault Severity         Frequency
                                 0                          23
                                 1                          19
                                 2                          50
                                 3                          17
                              TABLE 6.2
                              Maintenance Effort
                              Maintenance Effort          Frequency
                              Very high                          15
                              High                               10
                              Medium                             15
210                                                        Empirical Research in Software Engineering
                    TABLE 6.3
                    Statistical Measures with Corresponding Relevant Scale Types
                    Measures                            Relevant Scale Type
                    Mean                  Interval and ratio data that are not skewed.
                    Median                Ordinal, interval, and ratio, but not useful for
                                           ordinal scales having few values.
                    Mode                  All scale types, but not useful for scales having
                                           multiple values.
Table 6.3 depicts the relevant scale type of data for each statistical measure.
  Consider the following data set:
The mean, median, and mode are shown in Table 6.4, as each measure has different ways
for computing “average” values. In fact, if the data is symmetrical, all the three measures
(mean, median, and mode) have the same values. But, if the data is skewed, there will
always be difference between these measures. Figure  6.2  shows the symmetrical and
skewed distributions. The symmetrical curve is a bell-shaped curve, where all the data
points are equally distributed.
  Usually, when the data is skewed, the mean is a misleading measure for determining
central values. For example, if we calculate average lines of code (LOC) of 10  modules
given in Table 6.5, it can be seen that most of the values of the LOC are between 200 and 400,
but one module has 3,000 LOC. In this case, the mean will be 531. Only one value has influ-
enced the mean and caused the distribution to skew to the right. However, the median will
be 265, since the median is based on the midpoint and is not affected by the extreme values
                                    TABLE 6.4
                                    Descriptive Statistics
                                    Measure                         Value
                                    Mean                              29.43
                                    Median                            25
                                    Mode                              23
                                                    Mean
                                                    median
                            Mode                    mode                          Mode
                   Median                                                                Median
    Frequency
Mean Mean
FIGURE 6.2
Graphs representing skewed and symmetrical distributions: (a) left skewed, (b) normal (no skew), and (c) right
skewed.
Data Analysis and Statistical Testing                                                                211
                                   TABLE 6.5
                                   Sample Data of LOC for 10 Modules
                                   Module#             LOC   Module#         LOC
                                   1                   200        6          270
                                   2                   202        7          290
                                   3                   240        8          300
                                   4                   250        9          301
                                   5                   260       10          3,000
in the data distribution. Hence, the median better reflects the average LOC in modules as
compared to the mean and is the best measure when the data is skewed.
FIGURE 6.3
Quartiles.
212                                                       Empirical Research in Software Engineering
                                                 Median
                             Q1                                         Q3
200 202 240 250 260 270 290 300 301 3,000
FIGURE 6.4
Example of quartile.
The IQR is defined as the difference between upper quartile and lower quartile and is given as,
                                        IQR = Q3 − Q1
	
For example, for Table 6.5, the quartiles are shown in Figure 6.4.
                             IQR = Q3 − Q1 = 300 − 240 = 60
	
The standard deviation is used to measure the average distance a data point has from the
mean. The standard deviation assesses the spread by calculating the distance of the data
point from the mean. The standard deviation is large, if most of the data points are near to
the mean. The standard deviation (σx) for the population is given as:
                                                 ∑( x − µ)
                                                              2
                                          σx =
	                                                     N
where:
  x is the given value
  N is the number of values
  µ is the mean of all the values
68.3%
                                               34.15         34.15
                                         200           250           300
FIGURE 6.5
Normal curve.
                          TABLE 6.6
                          Range of Distribution for Normal Data Sets
                          S. No.   Mean        Standard Deviation          Ranges
                          1          250                     50            200–300
                          2          220                     60            160–280
                          3          200                     30            170–230
                          4          200                     10            190–210
TABLE 6.7
Sample Fault Count Data
Fault Count     Data1     35, 45, 45, 55, 55, 55, 65, 65, 65, 65, 75, 75, 75, 75, 75, 85, 85, 85, 85, 95, 95, 95,
                           105, 105, 115
                Data2     0, 2, 72, 75, 78, 80, 80, 85, 85, 87, 87, 87, 87, 88, 89, 90, 90, 92, 92, 95, 95, 98, 98,
                           99, 102
                Data3     20, 37, 40, 43, 45, 52, 55, 57, 63, 65, 74, 75, 77, 82, 86, 86, 87, 89, 89, 90, 95, 107,
                           165, 700, 705
5 20
             4
                                                                                        15
 Frequency
                                                                            Frequency
             3
                                                                                        10
             2
                                                                                        5
             1
             0                                                                           0
                 20.00   40.00   60.00          80.00      100.00 120.00                     00   20.00 40.00 60.00 80.00 100.00 120.00
  (a)                             Fault count                               (b)                             Fault count
12
10
                                                      8
                                         Frequency
                                                      0
                                                          00    200.00     400.00    600.00           800.00
                                          (c)                            Fault count
FIGURE 6.6
Histogram analysis for fault count data given in Table 6.7: (a) Data1, (b) Data2, and (c) Data3.
For example, suppose that one calculates the average of LOC, where most values are
between 1,000 and 2,000, but the LOC for one module is 15,000. Thus, the data point with
the value 15,000 is located far away from the other values in the data set and is an outlier.
Outlier analysis is carried out to detect the data points that are overinfluential and must be
considered for removal from the data sets.
   The outliers can be divided into three types: univariate, bivariate, and multivariate.
Univariate outliers are influential data points that occur within a single variable. Bivariate
outliers occur when two variables are considered in combination, whereas multivariate
outliers occur when more than two variables are considered in combination. Once the out-
liers are detected, the researcher must make the decision of inclusion or exclusion of the
identified outlier. The outliers generally signal the presence of anomalies, but they may
sometimes provide interesting patterns to the researchers. The decision is based on the
reason of the occurrence of the outlier.
   Box plots, z-scores, and scatter plots can be used for detecting univariate and bivariate
outliers.
Median
                                                                              End of the
                                                                                 tail
              Start of   Lower quartile                  Upper quartile
              the tail
FIGURE 6.7
Example box plot.
signify the start and end of the tail. These two boundary lines correspond to ±1.5  IQR.
Thus, once the value of IQR is known, it is multiplied by 1.5. The values shown inside of
the box plots are known to be within the boundaries, and hence are not considered to be
extreme. The data points beyond the start and end of the boundaries or tail are considered
to be outliers. The distance between the lower and the upper quartile is often known as
box length.
  The start of the tail is calculated as Q 3  −  1.5  ×  IQR and end of the tail is calculated
as Q 3  +  1.5  ×  IQR. To avoid negative values, the values are truncated to the nearest
values of the actual data points. Thus, actual start of the tail is the lowest value in the
variable above (Q 3  −  1.5  ×  IQR), and actual end of the tail is the highest value below
(Q 3 − 1.5 × IQR).
  The box plots also provide information on the skewness of the data. The median lies in
the middle of the box if the data is not skewed. The median lies away from the middle if
the data is left or right skewed. For example, consider the LOC values given below for a
software:
200, 202, 240, 250, 260, 270, 290, 300, 301, 3000
The median of the data set is 265, lower quartile is 240, and upper quartile is 300. The IQR
is 60. The start of the tail is 240 − 1.5 × 60 = 150 and end of the tail is 300 + 1.5 × 60 = 390.
The actual start of the tail is the lowest value above 150,  that is, 200,  and actual end of
the tail is the highest value below 390, that is, 301. Thus, the case number 10 with value
30,000 is above the end of the tail and, hence, is an outlier. The box plot for the given data
set is shown in Figure 6.8 with one outlier 3,000.
   A decision regarding inclusion or exclusion of the outliers must be made by the research-
ers during data analysis considering the following reasons:
Outlier values may be present because of combination of data values present across more
than one variable. These outliers are called multivariate outliers. Scatter plot is another
visualization method to detect outliers. In scatter plots, we simply represent all the data
points graphically. The scatter plot allows us to examine more than one metric variable at
a given time.
216                                                   Empirical Research in Software Engineering
                   3000                               ∗
                                                          3000
2500
2000
1500
1000
500
LOC
FIGURE 6.8
Box plot for LOC values.
6.1.5.2 Z-Score
Z-score is another method to identify outliers and is used to depict the relationship of a
value to its mean, and is given as follows:
                                                      x −µ
                                          z-score =
	                                                       σ
where:
 x is the score or value
 µ is the mean
 σ is the standard deviation
The z-score gives the information about the value as to whether it is above or below
the mean, and by how many standard deviations. It may be positive or negative. The
z-score values of data samples exceeding the threshold of ±2.5  are considered to be
outliers.
      Example 6.1:
      Consider the data set given in Table 6.7. Calculate univariate outliers for each variable
      using box plots and z-scores.
      Solution:
      The box plots for Data1, Data2, and Data3 are shown in Figure 6.9. The z-scores for
      data sets given in Table 6.7 are shown in Table 6.8.
To identify multivariate outliers, for each data point, the Mahalanobis Jackknife distance
D measure can be calculated. Mahalanobis Jackknife is a measure of the distance in
multidimensional space of each observation from the multivariate mean center of the
observations (Hair et al. 2006). Each data point is evaluated using chi-square distribution
with 0.001 significance value.
Data Analysis and Statistical Testing                                                                     217
120 120
100 100
                                                            80
  80
                                                            60
  60
                                                            40
  40
                                                            20
  20                                                                                 ∗1
                                                                 0                   ∗2
  (a)                       Data1                           (b)                  Data2
                             800
                                                            25
                                                           ∗∗
                                                            24
                             600
400
200 23
                                0
                            (c)                          Data3
FIGURE 6.9
(a)–(c) Box plots for data given in Table 6.7.
         TABLE 6.8
         Z-Score for Data Sets
         Case No.       Data1       Data2        Data3     Z-scoredata1   Z-scoredata2    Z-scoredata3
          1                35          0           20            −1.959     −3.214           −0.585
          2                45          2           37            −1.469     −3.135           −0.488
          3                45         72           40            −0.979     −0.052           −0.404
          4                55         75           43            −0.489     −0.052           −0.387
          5                55         78           45            −0.489      0.145           −0.375
          6                55         80           52            −0.489      0.145           −0.341
          7                65         80           55            −0.489      0.224           −0.330
          8                65         85           57             0          0.224           −0.279
          9                65         85           63             0          0.224           −0.273
         10                65         87           65             0          0.224           −0.262
         11                75         87           74             0          0.264           −0.234
         12                75         87           75             0          0.303           −0.211
         13                75         87           77             0.489      0.343           −0.211
         15                75         89           86             0.489      0.422           −0.194
         16                85         90           86             0.489      0.422           −0.194
                                                                                            (Continued)
218                                                    Empirical Research in Software Engineering
    • The size of a class measured in terms of lines of source code ranges from 0 to 2,313.
    • The values of depth of inheritance tree (DIT) and number of children (NOC)
      are low in the system, which shows that inheritance is not much used in all the
Data Analysis and Statistical Testing                                                                   219
         TABLE 6.9
         Descriptive Statistics for Metrics
                                                            Std.      Percentile           Percentile
         Metric      Min.   Max.        Mean     Median     Dev.        (25%)                (75%)
         CBO          0           24      8.32       8       6.38              3                14
         LCOM         0          100     68.72      84      36.89           56.5                96
         NOC          0            5      0.21       0        0.7              0                 0
         RFC          0          222     34.38      28       36.2             10              44.5
         WMC          0          100     17.42      12      17.45              8                22
         LOC          0         2313    211.25     108     345.55              8             235.5
         DIT          0            6         1       1       1.26              0               1.5
            TABLE 6.10
            Correlation Analysis Results
            Metric        CBO          LCOM      NOC      RFC       WMC            LOC        DIT
            CBO         1
            LCOM        0.256           1
            NOC        −0.03           −0.028     1
            RFC         0.386           0.334    −0.049   1
            WMC         0.245           0.318     0.035   0.628     1
            LOC         0.572           0.238    −0.039   0.508     0.624          1
            DIT         0.4692          0.256    −0.031   0.654     0.136          0.345       1
    systems; similar results have also been shown by others (Chidamber et al. 1998;
    Cartwright and Shepperd 2000; Briand et al. 2000a).
  • The lack of cohesion in methods (LCOM) measure, which counts the number of classes
    with no attribute usage in common, has high values (upto 100) in KC1 data set.
                                    Attribute                            Attribute
                                    selection                           extraction
                                   techniques                           techniques
FIGURE 6.10
Attribute reduction procedure.
Hence, attribute reduction leads to improved computational efficiency, lower cost, increased
problem understanding, and improved accuracy. Figure 6.11 shows the categories of attri-
bute reduction methods.
                                                          Attribute
                                                          reduction
                                                           methods
                                      Attribute                                 Attribute
                                      selection                                 extraction
FIGURE 6.11
Classification of attribute reduction methods.
Data Analysis and Statistical Testing                                                         221
Bad
                                             Reduced
          Original set       Attribute         set       Attribute
                                                                                 Evaluate?
                         subset generation              measurement
                                                                                       Good
                                             Testing                  Training
             Accuracy                         data                      data
                                Model                    Learning
                              validation                 algorithm
FIGURE 6.12
Procedure of filter method.
Bad
FIGURE 6.13
Procedure of wrapper method.
222                                                      Empirical Research in Software Engineering
                           P2 = ( b21 × X1 ) + ( b22 × X2 ) +  + ( b2 k × Xk )
	                          
                           Pk = ( bk 1 × X1 ) + ( bk 2 × X2 ) +  + ( bkk × Xk )
Data Analysis and Statistical Testing                                                         223
All bij ’s called loadings are worked out in such a way that the extracted P.C. satisfies the
following two conditions:
The variables with high loadings help identify the dimension the P.C. is capturing, but this
usually requires some degree of interpretation. To identify these variables, and interpret
the P.C.s, the rotated components are used. As the dimensions are independent, orthogo-
nal rotation is used, in which the axes are maintained at 90  degrees. There are various
strategies to perform such rotation. This includes quartimax, varimax, and equimax
orthogonal rotation. For detailed description refer Hair et al. (2006) and Kothari (2004).
  Varimax method maximizes the sum of variances of required loadings of the factor matrix
(a table displaying the factor loadings of all variables on each factor).Varimax rotation is
the most frequently used strategy in literature. Eigenvalue (or latent root) is associated
with each P.C. It refers to the sum of squared values of loadings relating to a dimension.
Eigenvalue indicates the relative importance of each dimension for the particular set of vari-
ables being analyzed. The P.C. with eigenvalue >1 is taken for interpretation (Kothari 2004).
6.2.3 Discussion
It is useful to interpret the results of regression analysis in the light of results obtained from
P.C. analysis. P.C. analysis shows the main dimensions, including independent variables
as the main drivers for predicting the dependent variable. It would also be interesting to
observe the metrics included in dimensions across various replicated studies; this will help
in finding differences across various studies. From such observations, the recommendations
regarding which independent variable appears to be redundant and need not be collected
can be derived, without losing a significant amount of design information (Briand and Wust
2002). P.C. analysis is a widely used method for removing redundant variables in neural
networks.
   The univariate analysis is used in preselecting the metrics with respect to their signifi-
cance, whereas CFS is the widely used method for preselecting independent variables in
machine learning methods (Hall 2000). In Hall (2003), the results showed that CFS chooses
few attributes, is faster, and overall good performer.
In the given example, the x is attributes of animals with critical area c  =  run, walk, sit,
and so on. These are the values that will cause null hypothesis to be rejected. The test is
“whether x ≠ fly”; if yes, reject null hypothesis, otherwise accept it. Hence, if x=fly that
means that null hypothesis is accepted.
  In real-life, a software practitioner may want to prove that the decision tree algorithms
are better than the logistic regression (LR) technique. This is known as assumption of the
researcher. Hence, the null hypothesis can be formulated as “there is no difference between
the performance of the decision tree technique and the LR technique.” The assumption
needs to be evaluated using statistical tests on the basis of data to reach to a conclusion.
In empirical research, hypothesis formulation and evaluation are the bottom line of research.
  This section will highlight the concept of hypothesis testing, and the steps followed in
hypothesis testing.
6.3.1 Introduction
Consider a setup where the researcher is interested in whether some learning technique
“Technique X” performs better than “Technique Y” in predicting the change proneness of
a class. To reach a conclusion, both technique X and technique Y are used to build change
prediction models. These prediction models are then used to predict the change proneness
of a sample data set (for details on training and testing of models refer Chapter 7) and
based on the outcome observed over the sample data set, it is determined which technique
is the better predictor out of the two. However, concluding which technique is better is a
challenging task because of the following issues:
   1. The number of data points in the sample could be very large, making data analysis
      and synthesis difficult.
   2. The researcher might be biased towards one of the techniques and could overlook
      minute differences that have the potential of impacting the final result greatly.
   3. The conclusions drawn can be assumed to happen by chance because of bias in the
      sample data itself.
To neutralize the impact of researcher bias and ensure that all the data points contribute
to the results, it is essential that a standard procedure be adopted for the analysis and
synthesis of sample data. Statistical tests allow the researcher to test the research questions
(hypotheses) in a generalized manner. There are various statistical tests like the student
t-test, chi-squared test, and so on. Each of these tests is applicable to a specific type of data
and allows for comparison in such a way that using the data collected from a small sample,
conclusions can be drawn for the entire population.
  Step 1: Define hypothesis—In the first step, the hypothesis is defined corresponding
     to the outcomes. The statistical tests are used to verify the hypothesis formed in
     the experimental design phase.
Data Analysis and Statistical Testing                                                       225
FIGURE 6.14
Steps in statistical tests.
226                                                  Empirical Research in Software Engineering
                                                                                   One-way
                                                                Parametric
                                                                                   ANOVA
                                             More than
                                            two samples
                                                                                   Kruskal–
                                                               Nonparametric
                                                                                    Wallis
                            Independent
                              samples
                                                                Parametric           T-test
Two samples
                                                               Nonparametric        Mann–
                                                                                   Whitney U
                                                                                    Related
                                                                Parametric          measures
                                             More than                              ANOVA
                                            two samples
      Statistical                                              Nonparametric       Friedman
         tests
                             Dependent
                              samples
                                                                Parametric        Paired t-test
                                            Two samples
                                                                                   Wilcoxon
                                                               Nonparametric
                             Association                                          signed-rank
                              between       Chi-square
                              variables
                                             Univariate
                               Causal
                                             regression
                            relationships
                                              analysis
FIGURE 6.15
Categories of statistical tests
for testing the hypothesis for binary dependent variable. Table 6.11 depicts the summary
of assumptions, data scale, and normality requirement for each statistical test discussed
in this chapter.
TABLE 6.11
Summary of Statistical Tests
Test                              Assumptions                     Data Scale         Normality
One sample t-test       The data should not have any        Interval or ratio.       Required
                         significant outliers.
                        The observations should be
                         independent.
Two sample t-test       Standard deviations of the two      Interval or ratio.       Required
                         populations must be equal.
                        Samples must be independent of      Interval or ratio.
                         each other.
                        The samples are randomly drawn      Interval or ratio.
                         from respective populations.
Paired t-test           Samples must be related with each   Interval or ratio.       Required
                         other.
                        The data should not have any
                         significant outliers.
Chi-squared test        Samples must be independent of      Nominal or ordinal.      Not required
                         each other.
                        The samples are randomly drawn
                         from respective populations.
F-test                  All the observations should be      Interval or ratio.       Required
                         independent.
                        The samples are randomly drawn
                         from respective populations and
                         there is no measurement error.
One-way ANOVA           One-way ANOVA should be used        Interval or ratio.       Required
                         when you have three or more
                         independent samples.
                        The data should not have any
                         significant outliers.
                        The data should have homogeneity
                         of variances.
Two-way ANOVA           The data should not have any        Interval or ratio.       Required
                         significant outliers.
                        The data should have homogeneity
                         of variances.
Wilcoxon signed test    The data should consist of two      Ordinal or continuous.   Not required
                         “related groups” or “matched
                         pairs.”
Wilcoxon–Mann–          The samples must be independent.    Ordinal or continuous.   Not required
 Whitney test
Kruskal–Wallis test     The test should validate three or   Ordinal or continuous.   Not required
                         more independent sample
                         distributions.
                        The samples are drawn randomly
                         from respective populations.
Friedman test           The samples should be drawn         Ordinal or continuous.   Not required
                         randomly from respective
                         populations.
228                                                     Empirical Research in Software Engineering
Here, the alternative hypothesis specifies that the population mean is strictly “greater than”
sample mean. The below hypothesis is an example of two-tailed test:
                                              H 0 : µ = µ0
	
                                         H a : µ < µ 0 or µ > µ 0
	
Figure  6.16  shows the probability curve for a two-tailed test with rejection (or critical
region) on both sides of the curve. Thus, the null hypothesis is rejected if sample mean lies
in either of the rejection region. Two-tailed test is also called nondirectional test.
  Figure 6.17 shows the probability curve for one-tailed test with rejection region on one
side of the curve. One-tailed test is also referred as directional test.
FIGURE 6.16
Probability curve for two-tailed test.
FIGURE 6.17
Probability curve for one-tailed test.
Data Analysis and Statistical Testing                                                                229
               TABLE 6.12
               Types of Errors
                                              H0 True                         H0 False
level of a test. Type II error is defined as the probability of wrongly not rejecting the null
hypothesis when the null hypothesis is false. In other words, a type II error occurs when
the null hypothesis is actually false, but somehow, it fails to get rejected. It is also known
as “false negative”; a result when an actual “miss” is erroneously seen as a “hit.” The rate
of the type II error is denoted by the Greek letter beta (β) and related to the power of a test
(which equals 1 − β). The definitions of these errors can also be tabularized as shown in
Table 6.12.
6.4.6 t-Test
W. Gossett designed the student t-test (Student 1908). The purpose of the t-test is to
determine whether two data sets are different from each other or not. It is based on the
assumption that both the data sets are normally distributed. There are three variants of
t-tests:
   1. One sample t-test, which is used to compare mean with a given value.
   2. Independent sample t-test, which is used to compare means of two independent
      samples.
   3. Paired t-test, which is used to compare means of two dependent samples.
and is compared with a given value of interest. The aim of one sample t-test is to find
whether there is sufficient evidence to conclude that there is difference between mean
of a given sample from a specified value. For example, one sample t-test can be used to
determine whether the average increase in number of comment lines per method is more
than five after improving the readability of the source code.
  The assumption in the one sample t-test is that the population from which the sample is
derived must have normal distribution. The following null and alternative hypotheses are
formed for applying one sample t-test on a given problem:
                                               µ − µ0
                                          t=
                                               σ n
where:
 µ represents mean of a given sample
 σ represents standard deviation
 n represents sample size
The above hypothesis is based on two tailed t-test. The degrees of freedom (DOFs) is
n − 1 as t-test is based on the assumption that the standard deviation of the population
is equal to the standard deviation of the sample. The next step is to obtain significance
values (p-value) and compare it with the established threshold value (α). To obtain p-value
for the given t-statistic, the t-distribution table needs to be referred. The table can only be
used given the DOF.
      Example 6.2:
      Consider Table 6.13 where the number of modules for 15 software systems are shown.
      We want to conclude that whether the population from which sample is derived is on
      average different than the 12 modules.
              TABLE 6.13
              Number of Modules
                                       Module                Module
              Module No.    Module#     No.        Module#    No.       Module#
              S1                  10     S6             35      S11        24
              S2                  15     S7             26      S12        23
              S3                  24     S8             29      S13        14
              S4                  29     S9             19      S14        12
              S5                  16     S10            18      S15         5
Data Analysis and Statistical Testing                                                              231
     Solution:
     The following steps are carried out to solve the example:
                                           µ − µ 0 19.93 − 12
                                      t=          =           = 3.76
                                           σ n      8.17 15
     	
            The DOF is 14 (15 − 1) in this example.
            To obtain the p-value for a specific t-statistic, we perform the following steps,
            referring to Table 6.14:
            1. For corresponding DOF, named df, identify the row with the desired DOF.
                In this example, the desired DOF is 14.
            2. Now, in the desired row, mark out the t-score values between which the
                computed t-score falls. In this example, the calculated t-statistic is 3.76.
                This t-statistic falls beyond the t-score of 2.977.
            3. Now, move upward to find the corresponding p-value for the selected t-score
                for either one-tail or two-tail significance test. In this example, the signifi-
                cance value for one-tail test would be <0.005, and for two-tail test it would
                be <0.01.
            Given 14 DOF and referring the t-distribution table, the obtained p-value is 0.002.
                          TABLE 6.14
                          Critical Values of t-Distributions
                          Level of significance for one-tailed test
                                  0.10       0.05       0.02        0.01     0.005
  H0: µ1 = µ2 (There is no difference in the mean values of both the samples.)
  Ha: µ1 ≠ µ2 (There is difference in the mean values of both the samples.)
                                                       µ1 − µ 2
                                        t=
	
                                               (   2
                                        σ n1 + σ22 n2
                                                   1     ) (      )
where:
  µ1 and µ2 are the means of both the samples, respectively
  σ1 and σ2 are the standard deviations of both the samples, respectively
The DOF is n1 + n2 − 1, where n1  and n2  are the sample sizes of both the samples. Now,
obtain the significance value (p-value) and compare it with the established threshold value
(α) for the computed t-statistic using the t-distribution.
      Example 6.3:
      Consider an example for comparing the properties of industrial and open source soft-
      ware in terms of the average amount of coupling between modules (the other modules
      to which a module is coupled). The purpose of both the software is to serve as text
      editors developed in Java language. In this example, we believe that the type of software
      affects the amount of coupling between modules.
         Industrial: 150, 140, 172, 192, 186, 180, 144, 160, 188, 145, 150, 141
         Open source: 138, 111, 155, 169, 100, 151, 158, 130, 160, 156, 167, 132
      Solution:
         Step 1: Formation of hypothesis.
            In this step, null (H0) and alternative (Ha) hypotheses are formed. The hypoth-
            eses for the example are given below:
Data Analysis and Statistical Testing                                                               233
                TABLE 6.15
                Descriptive Statistics
                Descriptive Statistic       Industrial Software             Open Source Software
                No. of observations                       12                        12
                Mean                                     162.33                    143.92
                Standard deviation                        20.01                     21.99
                                 µ1 − µ 2                       162.33 − 143.92
                     t=                              =                                    = 2.146
     	
                          (         ) (
                              σ12 n1 + σ 22 n2   )        (           ) (
                                                              20.012 12 + 21.992 12   )
            The DOF is 22 (12 + 12 − 2) in this example. Given 22 DOF and referring the
            t-distribution table, the obtained p-value is 0.043.
         Step 4: Define significance level.
            As computed in Step 3, the p-value is 0.043. It can be seen that the results are
            statistically significant at 0.05 significance value.
         Step 5: Derive conclusions.
            The results are significant at 0.05 significance level. Hence, we reject the null
            hypothesis, and the results show that the mean amount of coupling between
            modules depicted by the industrial software is statistically significantly differ-
            ent than the mean amount of coupling between modules depicted by the open
            source software (t = 2.146, p = 0.043).
  H0: µ1 − µ2 = 0 (There is no difference between the mean values of the two samples.)
  Ha: µ1 − µ2 ≠ 0 (There exists difference between the mean values of the two samples.)
                                                     ∑ d − ( ∑ d ) n
                                                        2           2
                                              σd =
                                                           n−1
where:
 n represents number of pairs and not total number of samples
 d is difference between values of two samples
The DOF is n − 1. The p-value is obtained and compared with the established threshold
value (α) for the computed t-statistic using the t-distribution.
      Example 6.4:
      Consider an example where values of the CBO (number of other classes to which a class
      is coupled to) metric is given before and after applying refactoring technique to improve
      the quality of the source code. The data is given in Table 6.16.
      Solution:
         Step 1: Formation of hypothesis.
            In this step, null (H0) and alternative (Ha) hypotheses are formed. The hypoth-
            eses for the example are given below:
                 H0: µCBO1 = µCBO2 (Mean of CBO metric before and after applying refactoring
                     are equal.)
                 Ha: µCBO1 ≠ µCBO2 (Mean of CBO metric before and after applying refactoring
                     are not equal.)
         Step 2: Select the appropriate statistical test.
            The samples are extracted from populations with normal distribution. As
            we are using samples derived from the same populations and analyzing the
            before and after effect of refactoring on CBO, these are related samples. We
            need to use paired t-test for comparing the difference between values of CBO
            derived from two dependent samples.
         Step 3: Apply test and calculate p-value.
            We first calculate the mean values of both the samples and also calculate the
            difference (d) among the paired values of both the samples as shown in Table 6.17.
            The t-statistic is given below:
                                 ∑ d − ( ∑ d )          
                                                      2
                                          2
                                                          n   12 − ( 8 ) 15 
                                                                          2
                         σd =                               =                = 0.743
      	                                       n−1                     14
                                          µ1 − µ 2 67.6 − 67.07
                                     t=           =             = 2.779
      	
                                          σd n      0.743 15
             The DOF is 14 (15  −  1) in this example. Given 14  DOF and referring the
             t-distribution table, the obtained p-value is 0.015.
TABLE 6.16
CBO Values
CBO before refactoring   45     48    49        52   56     58     66   67   74   75      81   82   83   88   90
CBO after refactoring    43     47    49        52   56     57     66   67   74   73      80   82   83   87   90
Data Analysis and Statistical Testing                                                              235
                  TABLE 6.17
                  CBO Values
                  CBO before Refactoring      CBO after Refactoring     Differences (d)
                  45                             43                            2
                  48                             47                            1
                  49                             49                            0
                  52                             52                            0
                  56                             56                            0
                  58                             57                            1
                  66                             66                            0
                  67                             67                            0
                  74                             74                            0
                  75                             73                            2
                  81                             80                            1
                  82                             82                            0
                  83                             83                            0
                  88                             87                            1
                  90                             90                            0
                  µCBO1 = 67.6                   µCBO2 = 67.07
where:
 Oij	is the observed frequency of the cell in the ith row and jth column
 Eij is the expected frequency of the cell in the ith row and jth column
                                                     N row × N column
                                     Erow,column =
	                                                           N
where:
 N is the total number of observations
 Nrow is the total of all observations in a specific row
 Ncolumn is the total of all observations in a specific column
 Erow,column is the grand total of a row or column
The larger the difference of the observed and the expected values, the more is the deviation
from the stated null hypothesis. The DOF is (row − 1) × (column − 1) for any given table.
The expected values are calculated for each category of the categorical variable at each factor
                                                        2
of the other categorical variable. Then, calculate the χ value for each cell. After calculating
             2                            2                                        2
individual χ value, add the individual χ values of each cell to obtain an overall χ value. The
         2
overall χ value is compared with the tabulated value for (row − 1) ×		(column − 1) DOF. If the
            2                                      2
calculated χ value is greater than the tabulated χ value at critical value α, we reject the null
hypothesis.
      Example 6.5:
      Consider Table 6.18 that consists of data for a particular software. It states the catego-
      rization of modules according to three maintenance levels (high, medium, and low)
      and according to the number of LOC (high and low). A researcher wants to investigate
      whether LOC and maintenance level are independent of each other or not.
                        TABLE 6.18
                        Categorization of Modules
                                              Maintenance Level
                                          High        Low      Medium   Total
                        LOC      High       23         40         22     85
                                 Low        17         30         20     67
                        Total               40         70         42    152
Data Analysis and Statistical Testing                                                                            237
                                                          N row × N column
                                         Erow,column =
     	                                                           N
                                                          ( Oij − Eij )
                                                                          2
     	
                                              2
                                             χ =     ∑         Eij
                                                                                                    2
            Finally, calculate the overall χ value by adding all corresponding χ values of
                                            2
each cell.
           TABLE 6.19
           Calculation of Expected Frequency
                                                               Maintenance Level
                                             High                             Low                Medium
           LOC          High           85 × 40                       85 × 70                  85 × 42
                                               = 22.36                       = 39.14                  = 23.48
                                        152                           152                      152
                        Low             67 × 40                      67 × 70                   67 × 42
                                                = 17.63                      = 30.85                   = 18.52
                                         152                          152                       152
             TABLE 6.20
             Calculation of χ2 Values
                                                      Maintenance Level
                                     High                            Low                    Medium
                                         2
             LOC     High     (23 − 22.36)                (40 − 39.14)    2
                                                                                       (22 − 23.48)2
                                           = 0.017                       = 0.018                     = 0.093
                                    23                        39.14                        23.48
                                                                       2
                     Low      (17 − 17.63)2               (30 − 30.85)                 (20 − 18.52)2
                                            = 0.022                      = 0.023                     = 0.118
                                  17.63                       30.85                        18.52
238                                                               Empirical Research in Software Engineering
      Example 6.6
      Analyze the performance of four algorithms when applied on a single data set as given
      in Table 6.21. Evaluate whether there is any significant difference in the performance of
      the four algorithms at 5% significance level.
      Solution:
         Step 1: Formation of hypothesis.
            The hypotheses for the example are given below:
                H0: There is no significant difference in the performance of the algorithms.
                Ha: There is significant difference in the performance of the algorithms.
         Step 2: Select the appropriate statistical test.
            To explore the “goodness-of-fit” of different algorithms when applied on a
            specific data set, we can effectively use chi-square test.
         Step 3: Apply test and calculate p-value.
            Calculate the expected frequency of each cell according to the following
            formula:
                                                     ∑
                                                           n
                                                                  Oi
                                                  E=       i =1
                                                          n
            where:
               Oi is the observed value of ith observation
               n is the total number of observations
                                                81 + 61 + 92 + 43
                                           E=                     = 69.25
                                                        4
                                           2
            Next, we calculate individual χ values as shown in Table 6.22.
                                      TABLE 6.21
                                      Performance Values of Algorithms
                                      Algorithm                        Performance
                                      A1                                    81
                                      A2                                    61
                                      A3                                    92
                                      A4                                    43
             TABLE 6.22
             Calculation of χ Values
                             2
                          Observed           Expected
                          Frequency         Frequency
                                                                                                    (Oij − Eij )
                                                                                                                   2
Now
                                                    (Oij − Eij )
                                                                      2
         	
                                         χ2 =   ∑        Eij
                                                                          = 20.393
            The DOF would be n  −  1,  that is, (4  −  1)  =  3. Given 3  DOF and referring
                  2                             2
            the χ -distribution table, we get χ value as 7.815 at α = 0.05, and the obtained
            p-value is 0.0001.
         Step 4: Define significance level.
            It can be seen that the results are statistically significant at 0.05  significance
            value as the obtained p-value in Step 3 is less than 0.05.
         Step 5: Derive conclusions.
            The results are significant at 0.05 significance level. Hence, we reject the null
            hypothesis, and the results show that there is significant difference in the per-
                                             2
            formance of four algorithms ( χ  = 20.393, p = 0.0001).
     Example 6.7:
     Consider a scenario where a researcher wants to find the importance of SLOC metric,
     in deciding whether a particular class having more than 50 source LOC (SLOC) will
     be defective or not. The details of defective and not defective classes are provided in
     Table 6.23. Test the result at 0.05 significance value.
     Solution:
        Step 1: Formation of hypothesis.
           The null and alternate hypotheses are formed as follows:
               H0: Classes having more than 50  SLOC will not be defective.
               Ha: Classes having more than 50 SLOC will be defective.
        Step 2: Select the appropriate statistical test.
           To investigate the importance of SLOC attribute in detection of defective and
           not defective classes, we can appropriately use chi-square test to find an attri-
           bute’s importance.
        Step 3: Apply test and calculate p-value.
           Calculate the expected frequency of each cell according to the following formula:
                                                        N row × N column
                                       Erow,column =
     	                                                         N
               Table 6.24 shows the observed and the calculated expected frequency of each
                                                            2
               cell. We also then calculate the individual χ value of each cell.
               Now
                                                   (Oij − Eij )
                                                                  2
     	
                                        χ2 =   ∑        Eij
                                                                      = 716.66
             TABLE 6.23
             SLOC Values for Defective and Not Defective Classes
                                                      Defective (D)             Not Defective (ND)   Total
             Number of classes having SLOC ≥ 50               200                      200             400
             Number of classes having SLOC < 50               100                      700             800
             Total                                            300                      900           1,200
240                                                    Empirical Research in Software Engineering
         TABLE 6.24
         Calculation of Expected Frequency
                                                                                                 (Oij − Eij )
                                                                                                                2
         Observed Frequency       Expected Frequency
                                                         (Oij − Eij )   (Oij − Eij )
                                                                                       2
                  Oij                      Eij                                                        Eij
         200                        400 × 300                 100          10,000                      100
                                              = 100
                                      1200
         200                        400 × 900                −100          10,000                    33.33
                                              = 300
                                      1200
         100                        800 × 300                −100          10,000                       50
                                              = 200
                                      1200
         700                        400 × 900                 400        160,000                   533.33
                                              = 300
                                      1200
      Example 6.8:
      Consider a scenario where 40 students had developed the same program. The size of the pro-
      gram is measured in terms of LOC and is provided in Table 6.25. Evaluate whether the size
      values of the program developed by 40 students individually follows normal distribution.
      Solution:
         Step 1: Formation of hypothesis.
            The null and alternative hypotheses are as follows:
                H0: The data follows a normal distribution.
                Ha: The data does not follow a normal distribution.
         Step 2: Select the appropriate statistical test.
            In the case of the normal distribution, there are two parameters, the mean (µ)
            and the standard deviation (σ) that can be estimated from the data. Based on
            the data, µ = 793.125 and σ = 64.81. To test the normality of data, we can use
            chi-square test.
         Step 3: Apply test and calculate p-value.
            We first need to divide data into segments in such a way that the segments
            have the same probability of including a value, if the data actually is normally
           TABLE 6.25
           LOC Values
           641       672    811      770         741   854      891      792               753        876
           801       851    744      948         777   808      758      773               734        810
           833       704    846      800         799   724      821      757               865        813
           721       710    749      932         815   784      812      837               843        755
Data Analysis and Statistical Testing                                                             241
           distributed with mean µ and standard deviation σ. We divide the data into
           10 segments. We find the upper and lower limits of all the segments. To find
           upper limit (xi) of ith segment, the following equation is used:
                                                                 i
                                               P ( X < xi ) =
                                                                10
           where:
              i = 1–9
              X is N(µ, σ2)
           where:
              i = 1–9
              Xs is N(0,1)
                                                             xi −µ
                                                      zi =
                                                               σ
           Using standard normal table, we can calculate the values of zi. We can then
           calculate the value of xi using the following equation:
                                                xi = σzi + µ
           The calculated values zi and xi are given in Table  6.26. Since, a normally
           distributed variable theoretically ranges from  −∞ to +∞, the lower limit of
           segment 1 is taken as –∞ and the upper limit of segment 10 is taken as +∞. The
           number of values that fall in each segment are also shown in the table. They
           represent the observed frequency (Oi). The expected number of values (Ei) in
           each segment can be calculated as 40/10 = 4.
           Now,
                                                    (Oij − Eij )
                                                                   2
     	
                                        χ2 =   ∑         Eij
                                                                       =5
                  TABLE 6.26
                  Segments and χ2 Calculation
                  Segment                 Lower          Upper
                  No.           zi        Limit          Limit              Oi   Ei   (O i−Ei)2
                  1            −1.28      −∞             710.17             4    4       0
                  2            −0.84      710.17         738.68             3    4       1
                  3            −0.525     738.68         759.10             7    4       9
                  4            −0.255     759.10         776.60             2    4       4
                  5             0         776.60         793.13             3    4       1
                  6             0.255     793.13         809.65             4    4       0
                  7             0.525     809.65         827.15             6    4       4
                  8             0.84      827.15         847.56             4    4       0
                  9             1.28      847.56         876.08             4    4       0
                  10              –       876.08         +∞                 3    4       1
242                                                         Empirical Research in Software Engineering
6.4.8 F-Test
F-test is used to investigate the equality of variance for two populations. A number
of assumptions need to be checked for application of F-test, which includes the follow-
ing (Kothari 2004):
We can formulate the following null and alternative hypotheses for the application of
F-test on a given problem with two populations:
                                                  ( σsample1 )
                                                              2
                                            F=
                                                  ( σsample2 )
                                                               2
                                                   ∑
                                                        n
                                                             ( xi − µ )
                                                                          2
                                      σsample =         i =1
                                                            n−1
where:
 n represents the number of observations in a sample
 xi represents the ith observation of the sample
 µ represents the mean of the sample observations
We also designate v1 as the DOF in the sample having greater variance and v2 as the DOF in the
other sample. The DOF is designated as one less than the number of observations in the cor-
responding sample. For example, if there are 5 observations in a sample, then the DOF is des-
ignated as 4 (5 − 1). The calculated value of F is compared with tabulated Fα (v1, v2) value at the
desired α value. If the calculated F-value is greater than Fα, we reject the null hypothesis (H0).
Data Analysis and Statistical Testing                                                                                      243
                  TABLE 6.27
                  Runtime Performance of Learning Techniques
                  A1               11              16           10         4         8           13      17     18     5
                  A2               14              17           9          5         7           11      19     21     4
     Example 6.9:
     Consider Table 6.27 that shows the runtime performance (in seconds) of two learning
     techniques (A1 and A2) on several data sets. We want to test whether the populations
     have the same variances.
     Solution:
        Step 1: Formation of hypothesis.
           In this step, null (H0) and alternative (Ha) hypothesis are formed. The hypoth-
           eses for the example are given below:
                H0: σ12 = σ22 (Variances of two populations are equal.)
                Ha: σ12 ≠ σ22 (Variances of two populations are not equal.)
        Step 2: Select the appropriate statistical test.
           The samples belong to normal populations and are independent in nature.
           Thus, to investigate the equality of variances of two populations, we use F-test.
        Step 3: Apply test and calculate p-value.
           In this example, n1 = 9 and n2 = 9. The calculation of two sample variances is as
           follows:
           We first compute the means of the two samples,
µ1 = 11.33 and µ2 = 11.89
                                     ∑
                                            9
                                                   ( xi − µ )
                                                                2
                                                                        (11 − 11.33 )
                                                                                         2
                               2                                                             +  + (5 − 11.33)2
                           σ   1   =        i =1
                                                                    =                                           = 26
     	                                          n1 − 1                                       9−1
                                 ∑
                                        9
                                               ( xi − µ )
                                                            2
                                                                     (14 − 11.89 )
                                                                                     2
                           2                                                             +  + (4 − 11.89)2
                       σ   2   =        i =1
                                                                =                                           = 38.36
     	                                      n2 − 1                                       9−1
                                                     σ 22 38.36                   2     2
                                            F=           =      = 1.47 (because σ 2  > σ1 )
     	                                               σ12   26
We also assume that all the other factors except the ones that are being investigated
are adequately controlled, so that the conclusions can be appropriately drawn. One-
way ANOVA, also called the single factor ANOVA, considers only one factor for analy-
sis  in  the outcome of the dependent variable. It is used for a completely randomized
design.
  In general, we calculate two variance estimates, one “within samples” and the other
“between samples.” Finally, we compute the F-value with these two variance estimates as
follows:
The computed F-value is then compared with the F-limit for specific DOF. If the computed
F-value is greater than the F-limit value, then we can conclude that the sample means
differ significantly.
t-test is sufficient. We can formulate the following null and alternative hypotheses for
application of one-way ANOVA on a given problem:
The steps for computing F-statistic is as follows. Here, we assume k is the number of sam-
ples and n is the number of levels:
  Step a: Calculate the means of each of the samples: µ1, µ2, µ3 … µk.
  Step b: Calculate the mean of sample means.
                                            µ1 +µ 2 +µ 3 ++µ k
                                     µ=
                                          Number of samples (k )
Step c: Calculate the sum of squares of variance between the samples (SSBS).
                 SSBS = n1 ( µ1 − µ ) + n2 ( µ 2 − µ ) + n3 ( µ 3 − µ ) + + nk ( µ k − µ )
                                      2               2               2                       2
  Step d: Calculate the sum of squares of variance within samples (SSWS). To obtain
     SSWS, we find the deviation of each sample observation with their corresponding
     mean and square the obtained deviations. We then sum all the squared deviations
     values to obtain SSWS.
  Step f: Calculate the mean square between samples (MSBS) and mean square
     within samples (MSWS), and setup an ANOVA summary as shown in Table 6.28.
  The calculated value of F is compared with tabulated Fα (k − 1, n − k) value at the
     desired α value. If the calculated F-value is greater than Fα, we reject the null
     hypothesis (H0).
    TABLE 6.28
    Computation of Mean Square and F-Statistic
    Source of Variation     Sum of Squares (SS)       DOF       Mean Square (MS)                  F-Ratio
    Between sample                   SSBS             k − 1              SSBS                          MSBS
                                                                   MSBS=                F −ratio=
                                                                         K −1                          MSWS
    Within sample                   SSWS              n − k              SSWS
                                                                   MSWS=
                                                                          n−k
    Total                            SSTV             n − 1
246                                                                   Empirical Research in Software Engineering
                                    TABLE 6.29
                                    Accuracy Values of Techniques
                                                                    Techniques
                                    Data Sets              A1           A2           A3
                                    D1                   60 (x11)     50 (x12)   40 (x13)
                                    D2                   40 (x21)     50 (x22)   40 (x23)
                                    D3                   70 (x31)     40 (x32)   50 (x33)
                                    D4                   80 (x41)     70 (x42)   30 (x43)
      Example 6.10:
      Consider Table 6.29 that shows the performance values (accuracy) of three techniques
      (A1, A2, and A3), which are applied on four data sets (D1, D2, D3, and D4) each. We want
      to investigate whether the performance of all the techniques calculated in terms of accu-
      racy (refer to Section 7.5.3 for definition of accuracy) are equivalent.
      Solution:
      The following steps are carried out to solve the example.
                    60 + 40 + 70 + 80                50 + 50 + 40 + 70               40 + 40 + 50 + 30
             µ1 =                     = 62.5 ; µ 2 =                   = 52.5 ; µ1 =                   = 40
                            4                                4                               4
             Step b: Calculate the mean of sample means.
                                                      µ1 + µ 2 + µ 3 ...+ µ k
                                           µ=
                                                    Number of samples (k )
      	
                                                     62.5 + 52.5 + 40
                                                µ=                    = 51.67
      	                                                      3
             Step c: Calculate the SSBS.
                       SSBS = n1 ( µ1 − µ ) + n2 ( µ 2 − µ ) + n3 ( µ 3 − µ ) +  + nk ( µ k − µ )
                                            2                   2                2                   2
      	
                    SSBS = 4 ( 62.5 − 51.67 ) + 4 ( 52.5 − 51.67 ) + 4 ( 40 − 51.67 ) = 1016.68
                                                2                       2                   2
      	
Data Analysis and Statistical Testing                                                                                 247
            	
                      SSWS = ( 60 − 62.5 ) +  + ( 80 − 62.5 )  + ( 50 − 52.5 ) +  + ( 70 − 52.5 ) 
                                           2                   2                  2                   2
                                                                                                     
                                 + ( 40 − 40 ) +  + ( 30 − 40 )  = 1550
                                               2                 2
 
               Step f: Calculate MSBS and MSWS, and setup an ANOVA summary as shown
                   in Table 6.30.
               The DOF for between sample variance is 2  and that for within sample vari-
               ance is 9. For the corresponding DOF, we compute the F-value using the
               F-distribution table and obtain the p-value as 0.103.
            Step 4: Define significance level.
               After obtaining the p-value in Step 3, we need to decide the threshold or α
               value. The calculated value of F at Step 3 is 2.95, which is less than the tabu-
               lated value of F (4.26) with DOF being v1 = 2 and v2 = 9 at 5% level. Thus, the
               results are not statistically significant at 0.05 significance value.
            Step 5: Derive conclusions.
               As the results are not statistically significant at 0.05  significance value, we
               accept the null hypothesis, which states that there is no difference in sample
               means and all the three techniques perform equally well. The difference in
               observed values of the techniques is only because of sampling fluctuations
               (F = 2.95, p = 0.103).
TABLE 6.30
Computation of Mean Square and F-Statistic
                       Sum of
Source of              Squares                                                                                  F-Limit
Variation                (SS)             DOF           Mean Square (MS)                   F-Ratio               (0.05)
Between sample         1016.68       3 − 1 = 2                1016.68                      508.34             F(2,9) = 4.26
                                                     MSBS =           = 508.34        F=          = 2.95
                                                                 2                         172.22
Within sample          1550          12 − 3 = 9                1550
                                                     MSWS =         = 172.22
                                                                 9
Total                  2566.68       11
248                                                    Empirical Research in Software Engineering
To perform the test, we compute the differences among the related pair of values of both
the treatments. The differences are then ranked based on their absolute values. We perform
the following steps while assigning ranks to the differences:
   1. Exclude the pairs where the absolute difference is 0. Let nr be the reduced number
      of pairs.
   2. Assign rank to the remaining nr pairs based on the absolute difference. The
      smallest absolute difference is assigned a rank 1.
   3. In case of ties among differences (more than one difference having the same
      value), each tied difference is assigned an average of tied ranks. For example,
      if there are two differences of data value 5 each occupying 7th and 8th ranks,
      we would assign the mean rank, that is, 7.5 ([7  +  8]/2  =  7.5) to each of the
      difference.
We now compute two variables R+ and R−. R+ represents the sum of ranks assigned to dif-
ferences, where the data instance in the first treatment outperforms the second treatment.
However, R− represents the sum of ranks assigned to differences, where the second treat-
ment outperforms the first treatment. They can be calculated by the following formula
(Demšar 2006):
                                        R+ =   ∑ rank ( d )
                                               di >0
                                                            i
                                        R− =   ∑ rank ( d )
                                               di <0
                                                            i
where:
 di is the difference between performance measures of first treatment from the second
        treatment when applied on n different data instances
                                          Q − ( 1 4 ) nr ( nr + 1)
                                 Z=
                                       (1 24 ) nr ( nr + 1) ( 2nr + 1)
If the Z-statistic is in the critical region with specific level of significance, then the null
hypothesis is rejected and it is concluded that there is significant difference between two
treatments, otherwise null hypothesis is accepted.
      Example 6.11:
      For example, consider an example where a researcher wants to compare the perfor-
      mance of two techniques (T1 and T2) on multiple data sets using a performance measure
      as given in Table 6.31. Investigate whether the performance of two techniques measured
      in terms of AUC (refer to Section 7.5.6 for details on AUC) differs significantly.
      Solution:
         Step 1: Formation of hypothesis.
            The hypotheses for the example are given below:
                H0: The performance of the two techniques does not differ significantly.
                Ha: The performance of the two techniques differs significantly.
Data Analysis and Statistical Testing                                                                             249
                                   TABLE 6.31
                                   Performance Values of Techniques
                                                                       Techniques
                                   Data Sets                         T1                T2
                                   D1                                0.75             0.65
                                   D2                                0.87             0.73
                                   D3                                0.58             0.64
                                   D4                                0.72             0.72
                                   D5                                0.60             0.70
di < 0
                         Q − ( 1 4 ) nr ( nr + 1)                      3.5 − ( 1 4 ) 4 ( 4 + 1)
                 Z=                                          =                                         = −0.549
                       (1 24 ) nr ( nr + 1) ( 2nr + 1)               (1 24 ) 4 ( 4 + 1) ( 2 × 4 + 1)
     	
            The obtained p-value is 0.581 with Z-distribution table, when DOF is (n − 1),
            that is, 1.
         Step 4: Define significance level.
                                       2
            The chi-square value is χ 0.05 = 3.841. As the test statistic value (Z = −0.549) is
                        2
            less than χ value, we accept the null hypothesis. The obtained p-value in Step
            3 is greater than α = 0.05. Thus, the results are not significant at critical value
            α = 0.05.
                            TABLE 6.32
                            Computing R+ and R−
                            Data Set       T1          T2             di      |di|      Rank(di)
                            D1            0.75        0.65       −0.10        0.10           2.5
                            D2            0.87        0.73       −0.14        0.14           4
                            D3            0.58        0.64        0.06        0.06           1
                            D4            0.72        0.72        0           0              –
                            D5            0.60        0.70        0.10        0.10           2.5
250                                                         Empirical Research in Software Engineering
    1. Arrange the data values of all the observations (both the samples) in ascending
       (low to high) order.
    2. Assign ranks to all the observations. The lowest value observation is provided
       rank 1, the next to lowest observation is provided rank 2,  and so on, with the
       highest observation given the rank N.
    3. In case of ties (more than one observation having the same value), each tied obser-
       vation is assigned an average of tied ranks. For example: if there are three observa-
       tions of data value 20 each occupying 7th, 8th, and 9th ranks, we would assign the
       mean rank, that is, 8 ([7 + 8 + 9]/3 =	8) to each of the observation.
    4. We then find the sum of all the ranks allotted to observations in sample 1  and
       denote it with T1. Similarly, find the sum of all the ranks allotted to observations in
       sample 2 and denote it as T2.
    5. Finally, we compute the U-statistic by the following formula:
                                                         n1 ( n1 + 1)
                                           U = n1.n2 +                − T1
	     	                                                        2
          or
                                                         n2 ( n2 + 1)
                                           U = n1.n2 +                − T2
	     	                                                        2
It can be observed that the sum of the U-values obtained by the above two formulas is
always equal to the product of the two sample sizes (n1.n2; Hooda 2003). It should be noted
Data Analysis and Statistical Testing                                                             251
that we should use the lower computed U-value as obtained by the two equations described
above. Wilcoxon–Mann–Whitney test has two specific cases (Anderson et al. 2002; Hooda
2003): (1) when the sample sizes are small (n1  <  7, n2  <  8) or (2) when the sample sizes
are large (n1  ≥  10, n2  ≥  10). The p-values for the corresponding computed U-values are
interpreted as follows:
                                          n1 . n2        n1 . n2 ( n1 + n2 +1)
                                  µU =            ; σU =
  	                                         2                      12
      Thus, the Z-statistic can be defined as,
                                                     U − µu
                                                Z=
                                                      σu
      Example 6.12:
      Consider an example for comparing the coupling values of two different software (one
      open source and other academic software), to ascertain whether the two samples are
      identical with respect to coupling values (coupling of a module corresponds to the
      number of other modules to which a module is coupled).
      Solution:
         Step 1: Formation of hypothesis.
            In this step, null (H0) and alternative (Ha) hypotheses are formed. The hypoth-
            eses for the example are given below:
                H0: µ1 − µ2 = 0 (The two samples are identical in terms of coupling values.)
                Ha: µ1 − µ2 ≠ 0 (The two sample are not identical in terms of coupling values.)
         Step 2: Select the appropriate statistical test.
            The two samples of our study are independent in nature, as they are collected
            from two different software. Also, the outcome variable (amount of coupling)
            is continuous or ordinal in nature. The data may not be normal. Hence, we
252                                                           Empirical Research in Software Engineering
                                 TABLE 6.33
                                 Computation of Rank Statistics for
                                 Coupling Values of Two Software
                                 Observations         Rank          Sample Name
                                 5                       1           Open source
                                 23                      2           Open source
                                 32                      3           Open source
                                 35                      4           Academic
                                 38                      5           Open source
                                 43                      6           Academic
                                 52                      7           Open source
                                 89                      8           Academic
                                 93                      9           Academic
                                                        n2 ( n2 + 1)
                                        U = n1.n2 +                  − T2
                                                              2
                                                      5 ( 5 +1)
                                           = 4. 5 +               − 18 = 17
                                                         2
           We compute the p-value to be 0.056 at α = 0.05 for the values of n1  and n2  as
           4 and 5, respectively, and the U-value as 3.
        Step 4: Define significance level.
           As the derived p-value of 0.056, in Step 3, is greater than 2α = 0.10, we accept
           the null hypothesis at α = 0.05. Thus, the results are not significant at α = 0.05.
        Step 5: Derive conclusions.
           As shown in Step 4, we accept the null hypothesis. Thus, we conclude that
           the coupling values of the academic and open source software do not differ
           significantly (U = 3, p = 0.056).
      Example 6.13:
      Let us consider another example for large sample size, where we want to ascertain
      whether the two sets of observations (sample 1 and sample 2) are extracted from identi-
      cal populations by observing the cohesion values of the two samples.
            Sample 1: 55, 40, 71, 59, 48, 40, 75, 46, 71, 72, 58, 76
            Sample 2: 46, 42, 63, 54, 34, 46, 72, 43, 65, 70, 51, 70
Data Analysis and Statistical Testing                                                          253
     Solution:
        Step 1: Formation of hypothesis.
           In this step, null (H0) and alternative (Ha) hypotheses are formed. The hypoth-
           esis for the example is given below:
                H0: µ1 − µ2 = 0 (The two samples are identical in terms of cohesion values.)
                Ha: µ1  −  µ2  ≠  0 (The two sample are not identical in terms of cohesion
                    values.)
        Step 2: Select the appropriate statistical test.
           The two samples of our study are independent in nature as they are collected
           from two different software. Also, the outcome variable (amount of cohesion) is
           continuous or ordinal in nature. The data may not be normal. Hence, we use the
           Wilcoxon–Mann–Whitney test for comparing the differences among cohesion
           values of the two software.
        Step 3: Apply test and calculate p-value.
           In this example, n1 = 12, n2 = 12, and N = 24. Table 6.34 shows the arrangement
           of all the observations in ascending order, and the ranks allocated to them.
                Sum of ranks assigned to observations in sample 1 (T1) = 2.5 + 2.5 + 7 + 9 
                + 12 + 13 + 14 + 19.5 + 19.5 + 21.5 + 23 + 24 = 167.5.
                Sum of ranks assigned to observations in sample 2 (T2) = 1 + 4 + 5 + 7 + 7 
                + 10 + 11 + 15 + 16 + 17.5 + 17.5 + 21.5 = 132.5.
                              TABLE 6.34
                              Computation of Rank Statistics for
                              Cohesion Values of Two Samples
                              Observations    Rank     Sample Name
                              34               1          Sample 2
                              40               2.5        Sample 1
                              40               2.5        Sample 1
                              42               4          Sample 2
                              43               5          Sample 2
                              46               7          Sample 1
                              46               7          Sample 2
                              46               7          Sample 2
                              48               9          Sample 1
                              51               10         Sample 2
                              54               11         Sample 2
                              55               12         Sample 1
                              58               13         Sample 1
                              59               14         Sample 1
                              63               15         Sample 2
                              65               16         Sample 2
                              70               17.5       Sample 2
                              70               17.5       Sample 2
                              71               19.5       Sample 1
                              71               19.5       Sample 1
                              72               21.5       Sample 1
                              72               21.5       Sample 2
                              75               23         Sample 1
                              76               24         Sample 1
254                                                          Empirical Research in Software Engineering
                                                n1 ( n1 + 1)
                                  U = n1.n2 +                − T1
                                                      2
                                                   12 ( 12 +1)
                                     = 12 ⋅ 12 +                 − 167.5 = 54.5
                                                       2
                                                n2 ( n2 + 1)
                                  U = n1.n2 +                − T2
                                                      2
                                                  12 ( 12 + 1)
                                    = 12 ⋅ 12 +                  − 132.5 = 89.5
                                                       2
            As the sample size is large, we can calculate the mean (µU) and standard devia-
            tion (σU) of the normal population as follows:
                                             U − µ u 54.5 − 72
                                        Z=          =          = − 1.012
        	                                     σu      17.32
  H0: µ1 = µ2 = … µk (All samples have identical distributions and belong to the same
    population.)
  Ha: µ1 ≠ µ2 ≠ … µk (All samples do not have identical populations and may belong to
    different populations.)
The steps to compute the Kruskal–Wallis test statistic H are very similar to that of
Wilcoxon–Mann–Whitney test statistic U. Assuming there are k samples of size n1, n2, … nk,
respectively, and the total number of observations N (N = n1 + n2 + … nk), we perform the
following steps:
   1. Organize and sort the data values of all the observations (belonging to all the
      samples) in an ascending (low to high) order.
Data Analysis and Statistical Testing                                                           255
   2. Next, allocate ranks to all the observations from 1 to N. The observation with the
      lowest data value is assigned a rank of 1, and the observation with the highest data
      value is assigned rank N.
   3. In case of two or more observations of equal values, assign the average of the
      ranks that would have been assigned to the observations. For example, if there
      are two observations of data value 40  each occupying 3rd and 4th ranks, we
      would assign the mean rank, that is, 3.5 (  3 + 4  2 = 3.5 ) to each of the 3rd and 4th
      observations.
   4. We then compute the sum of ranks allocated to observations in each sample and
      denote it as T1, T2… Tk.
   5. Finally, the H-statistic is computed by the following formula:
                                                       k
                                                           Ti 2
                                                      ∑
                                             12
                                  H=                            − 3 ( N + 1)
                                         N ( N + 1)        ni
	   	                                        i =1
The calculated H-value is compared with the tabulated χα	 value at (k − 1) DOF at the
                                                          2
desired α value. If the calculated H-value is greater than χα	 value, we reject the null
                                                            2
hypothesis (H 0).
     Example 6.14:
     Consider an example (Table 6.35) where three research tools were evaluated by 17 dif-
     ferent researchers and were given a performance score out of 100. Investigate whether
     there is a significant difference in the performance rating of the tools.
     Solution:
        Step 1: Formation of hypothesis.
           In this step, null (H0) and alternative (Ha) hypotheses are formed. The hypoth-
           eses for the example are given below:
               H0: µ1  =  µ2  =  µ3 (The performance rating of all tools does not differ
                    significantly.)
               Ha: µ1 ≠ µ2 ≠ µ3 (The performance rating of all tools differ significantly.)
        Step 2: Select the appropriate statistical test.
           The three samples are independent in nature as they are rated by 17 different
           researchers. The outcome variable is continuous. As we need to compare more
           than two samples, we use Kruskal–Wallis test to investigate whether there is a
           significant difference in the performance rating of the tools.
                                   TABLE 6.35
                                   Performance Score of Tools
                                                 Tools
                                   Tool 1        Tool 2        Tool 3
                                   30             65              55
                                   75             25              75
                                   65             35              65
                                   90             20              85
                                   100            45              95
                                   95                             75
256                                                               Empirical Research in Software Engineering
                          TABLE 6.36
                          Computation of Rank Kruskal–Wallis Test
                          for Performance Score of Research Tools
                                                                          Sample
                          Observations                          Rank       Name
                          20                                      1        Tool 2
                          25                                      2        Tool 2
                          30                                      3        Tool 1
                          35                                      4        Tool 2
                          45                                      5        Tool 2
                          55                                      6        Tool 3
                          65                                      8        Tool 1
                          65                                      8        Tool 2
                          65                                      8        Tool 3
                          75                                      11       Tool 1
                          75                                      11       Tool 3
                          75                                      11       Tool 3
                          85                                      13       Tool 3
                          90                                      14       Tool 1
                          95                                      15.5     Tool 1
                          95                                      15.5     Tool 3
                          100                                     17       Tool 1
                                  12       ( 68.5 )2 ( 20 )2 ( 64.5 )2 
                         =                          +       +           − 3 ( 17 + 1) = 7
                             17 ( 17 + 1)  6            5        6 
      	                                                                
   1. Organize and sort the data values of all the treatments for a specific data instance or
      data set in descending (high to low) order. Allocate ranks to all the observations from
      1 to k, where rank 1 is assigned to the best performing treatment value and rank k to
      the worst performing treatment. In case of two or more observations of equal values,
      assign the average of the ranks that would have been assigned to the observations.
   2. We then compute the total of ranks allocated to a specific treatment on all the data
      instances. This is done for all the treatments and the rank total for k treatments is
      denoted by R1, R 2, … Rk.
   3. Finally, the χ2-statistic is computed by the following formula:
                                                      k
                                                     ∑R
                                          12
                                χ2 =                            2
                                                                    − 3n ( k + 1)
                                       nk ( k + 1)
                                                            i
                                                     i =1
     where:
       Ri is the individual rank total of the ith treatment
       n is the number of data instances
The value of Friedman measure χ2 is distributed over k − 1 DOF. If the value of Friedman
measure is in the critical region (obtained from chi-squared table with specific level of
significance, i.e., 0.01 or 0.05 and k − 1 DOF), then the null hypothesis is rejected and it is
concluded that there is difference among performance of different treatments, otherwise
the null hypothesis is accepted.
     Example 6.15:
     Consider Table 6.37, where the performance values of six different classification methods
     are stated when they are evaluated on six data sets. Investigate whether the performance
     of different methods differ significantly.
258                                                                    Empirical Research in Software Engineering
                 TABLE 6.37
                 Performance Values of Different Methods
                                                                      Methods
                 Data Sets             M1                M2         M3         M4       M5           M6
                 D1                83.07                75.38       73.84   72.30       56.92        52.30
                 D2                66.66                75.72       73.73   71.71       70.20        45.45
                 D3                83.00                54.00       54.00   77.00       46.00        59.00
                 D4                61.93                62.53       62.53   64.04       56.79        53.47
                 D5                74.56                74.56       73.98   73.41       68.78        43.35
                 D6                72.16                68.86       63.20   58.49       60.37        48.11
      Solution:
         Step 1: Formation of hypothesis.
            In this step, null (H0) and alternative (Ha) hypotheses are formed. The hypoth-
            eses for the example are given below:
                H0: There is no statistical difference between the performances of various
                     methods.
                Ha: There is statistical significant difference between the performances of
                     various methods.
         Step 2: Select the appropriate statistical test.
            As we need to evaluate the difference between the performances of different
            methods when they are evaluated using six data sets, we are evaluating
            different treatments on different data instances. Moreover, there is no specific
            assumption for data normality. Thus, we can use Friedman test.
         Step 3: Apply test and calculate p-value.
            We compute the rank total allocated to each method on the basis of perfor-
            mance ranking of each method on different data sets as shown in Table 6.38.
            Now, compute the Friedman statistic,
                                       ∑R
                             12
                  χ2 =                           2
                                                     − 3n ( k +1)
                          nk ( k +1)
                                12
                      =
                          6 × 6 × ( 6 + 1)
                                            (                                       )
                                           13.52 + 13.52 + 18 2 + 192 + 292 + 33 2 − 3.6 ( 6 + 1) = 16.11
  	 	 	
                                                         DOF = k − 1 = 5
                      TABLE 6.38
                      Computation of Rank Totals for Friedman Test
                                                                     Methods
                      Data Sets                 M1          M2       M3     M4      M5          M6
                      D1                        1          2         3      4       5           6
                      D2                        5          1         2      3       4           6
                      D3                        1          4.5       4.5    2       6           3
                      D4                        4          2.5       2.5    1       5           6
                      D5                        1.5        1.5       3      4       5           6
                      D6                        1          2         3      5       4           6
                      Rank total                13.5       13.5      18     19      29          33
Data Analysis and Statistical Testing                                                            259
           We look up the tabulated value of χ2-distribution with 5  DOF, and find the
           tabulated value as 15.086 at α = 0.01. The p-value is computed as 0.007.
        Step 4: Define significance level.
           The calculated value of χ2 (χ2 = 16.11) is greater than the tabulated value. As the
           computed p-value in Step 3 is <0.01, the results are significant at α = 0.01.
        Step 5: Derive conclusions.
           Since the calculated value of χ2 is greater than the tabulated value, we reject the
           null hypothesis. Thus, we conclude that the performance of six methods differ
           significantly (χ2 = 16.11, p = 0.007).
                                                      k ( k + 1)
                                            CD = qα
                                                         6n
Here k corresponds to the number of subjects and n corresponds to the number of obser-
vations for a subject. The critical values (qα) are studentized range statistic divided by √2.
The computed CD value is compared with the difference between average ranks allocated
to two subjects. If the difference is at least equal to or greater than the CD value, the two
subjects differ significantly at the chosen significance level α.
     Example 6.16:
     Consider an example where we compare four techniques by analyzing the performance
     of the models predicted using these four techniques on six data sets each. We first apply
     Friedman test to obtain the average ranks of all the methods. The computed average
     ranks are shown in Table 6.39. The result of the Friedman test indicated the rejection
     of null hypothesis. Evaluate whether there are significant differences among different
     methods using pairwise comparisons.
     Solution:
        Step 1: Formation of hypothesis.
           In this step, null (H0) and alternative (Ha) hypotheses are formed. The hypoth-
           eses for the example are given below:
               H01: The performance of T1 and T2 techniques do not differ significantly.
               Ha1: The performance of T1 and T2 techniques differ significantly.
               H02: The performance of T1 and T3 techniques do not differ significantly.
               Ha2: The performance of T1 and T3 techniques differ significantly.
               H03: The performance of T1 and T4 techniques do not differ significantly.
                             TABLE 6.39
                             Average Ranks of Techniques after
                             Applying Friedman Test
                                                T1       T2        T3     T4
                             Average rank      3.67     2.67       1.92   1.75
260                                                           Empirical Research in Software Engineering
                         k ( k + 1)             4. ( 4 + 1)
               CD = qα                = 2.569                 = 1.91
  	 	 	                     6n                     6.6
         We now find the differences among ranks of each pair of techniques as shown
         in Table 6.40.
      Step 4: Define significance level.
         Table 6.41 shows the comparison results of critical difference and actual rank
         differences among different techniques. The rank difference of only T1–T4 pair
         is higher than the computed critical difference. The rank differences of all other
                                 TABLE 6.40
                                 Computation of Pairwise Rank
                                 Differences among Techniques
                                 for Nemenyi Test
                                 Pair                         Difference
                                 T1–T2                   3.67 − 2.67 = 1.00
                                 T1–T3                   3.67 − 1.92 = 1.75
                                 T1–T4                   3.67 − 1.75 = 1.92
                                 T2–T3                   2.67 − 1.92 = 0.75
                                 T2–T4                   2.67 − 1.75 = 0.92
                                 T3–T4                   1.92 − 1.75 = 0.17
                                   TABLE 6.41
                                   Comparison of Differences
                                   for Nemenyi Test
                                   Pair                       Difference
                                   T1–T2                      1.00 < 1.91
                                   T1–T3                      1.75 < 1.91
                                   T1–T4                      1.92 > 1.91
                                   T2–T3                      0.75 < 1.91
                                   T2–T4                      0.92 < 1.91
                                   T3–T4                      0.17 < 1.91
Data Analysis and Statistical Testing                                                              261
           technique pairs is not significant at α = 0.05. The rank difference of only T1–T4
           pair (shown in bold) is higher than the computed critical difference.
        Step 5: Derive conclusions.
           As the rank difference of only T1–T4 pair is higher than the computed critical dif-
           ference, we conclude that the T4 technique significantly outperforms T1 technique
           at significance level α = 0.05. The difference in performance of all other techniques
           is not significant. We accept all the null hypotheses H01–H06, except H03.
6.4.15 Bonferroni–Dunn Test
Bonferroni–Dunn test is a post hoc test that is similar to Nemenyi test. It can be used
to compare multiple subjects, even if the sample sizes are unequal. It is generally used
when all subjects are compared with a control subject (Demšar 2006). For example, all
techniques are compared with a specific control technique A for evaluating the compara-
tive pairwise performance of all techniques with technique A. Bonferroni–Dunn test is also
called Bonferroni correction and is used to control family-wise error rate. A family-wise
error may occur when we are testing a number of hypotheses referred to as family of
hypotheses, which are performed on a single set of data or samples. The probability that
at least one hypothesis may be significant just because of chance (Type I error) needs to
be controlled in such a case (Garcia et al. 2007). Bonferroni–Dunn test is mostly used after
a Friedman test, if the null hypothesis is rejected. To control family-wise error, the criti-
cal value α is divided by the number of comparisons. For example, if we are comparing
k − 1 subjects with a control subject then the number of comparisons is k − 1. The formula
for new critical value is as follows:
                                                     α
                                 α New =
                                           Number of comparisons
There is another method for performing the Bonferroni–Dunn’s test by computing the CD
(same as Nemenyi test). However, the α values used are adjusted to control family-wise
error. We compute the CD value as follows:
                                                     k ( k + 1)
                                           CD = qα
                                                         6n
Here k corresponds to the number of subjects and n corresponds to the number of obser-
vations for a subject. The critical values (qα) are studentized range statistic divided by √2.
Note that the number of comparisons in the Appendix table includes the control subject.
We compare the computed CD with difference between average ranks. If the difference is
less than CD, we conclude that the two subjects do not differ significantly at the chosen
significance level α.
     Example 6.17:
     Consider an example where we compare four techniques by analyzing the performance
     of the models predicted using these four techniques on six data sets each. We first apply
     Friedman test to obtain the average ranks of all the methods. The computed average
     ranks are shown in Table 6.42. The result of the Friedman test indicated the rejection of
     the null hypothesis. Evaluate whether there are significant difference among M1 and all
     the other methods.
262                                                                Empirical Research in Software Engineering
                        TABLE 6.42
                        Average Ranks of Techniques
                                                    T1             T2          T3       T4
                        Average rank            3.67               2.67       1.92      1.75
      Solution:
         Step 1: Formation of hypothesis.
            In this step, null (H0) and alternative (Ha) hypotheses are formed. The hypoth-
            eses for the example are given below:
                H01: The performance of T1 and T2 techniques do not differ significantly.
                Ha1: The performance of T1 and T2 techniques differ significantly.
                H02: The performance of T1 and T3 techniques do not differ significantly.
                Ha2: The performance of T1 and T3 techniques differ significantly.
                H03: The performance of T1 and T4 techniques do not differ significantly.
                Ha3: The performance of T1 and T4 techniques differ significantly.
         Step 2: Select the appropriate statistical test.
            The example needs to evaluate the comparison of T1 technique with all other
            techniques. Thus, T1 is the control technique. The evaluation of different tech-
            niques is performed using Friedman test, and the result led to rejection of
            the null hypothesis. To analyze whether there are any significant differences
            among the performance of the control technique and other techniques, we need
            to apply a post hoc test. Thus, we use Bonferroni–Dunn’s test.
         Step 3: Apply test and calculate CD.
            In this example, k = 4 and n = 6. The value of qα for four subjects at α = 0.05 is
            2.394. The CD can be calculated by the following formula:
                             k ( k + 1)              4. ( 4 + 1)
                  CD = qα                 = 2.394                  = 1.79
  	 	 	                         6n                       6.6
            We now find the differences among ranks of each pair of techniques, as shown
            in Table 6.43.
         Step 4: Define significance level.
            Table 6.44 shows the comparison results of critical difference and actual rank
            differences among different techniques. The rank difference of only T1–T4 pair
            is higher than the computed critical difference. However, the rank difference
            of T1–T3 is quite close to the critical difference. The difference in performance
            of T1–T2 is not significant.
         Step 5: Derive conclusions.
            As the rank difference of only T1–T4 pair is higher than the computed critical
            difference. We conclude that the T4  technique significantly outperforms
            T1  technique at significance level α  =  0.05. We accept the null hypothesis
                                  TABLE 6.43
                                  Computation of Pairwise Rank
                                  Differences among Techniques for
                                  Bonferroni–Dunn Test
                                  Pair                                Difference
                                  T1–T2                            3.67 − 2.67 = 1.00
                                  T1–T3                            3.67 − 1.92 = 1.75
                                  T1–T4                            3.67 − 1.75 = 1.92
Data Analysis and Statistical Testing                                                          263
                                   TABLE 6.44
                                   Comparison of Differences
                                   for Bonferroni–Dunn Test
                                   Pair                Difference
                                   T1–T2               1.00 < 1.79
                                   T1–T3               1.75 < 1.79
                                   T1–T4               1.92 > 1.79
           H03 and reject hypotheses H01 and H02. As the rank difference of only T1–T4 pair
           (shown in bold) is higher than the computed critical difference.
                                                      e(
                                                           A0 + A1X1 )
                                     prob ( X1 ) =
                                                     1 + e(
                                                              A0 + A1X1 )
where:
 X1 is an independent variable
 A1 is the weight
 Ao is a constant
The sign of the weight indicates the direction of effect of the independent variable on the
dependent variable. The positive sign indicates that independent variable has positive effect on
the dependent variable, and negative sign indicates that the independent variable has negative
effect on the dependent variable. The significance statistic is employed to test the hypothesis.
  In linear regression, t-test is used to find the significant independent variables and, in
LR, Wald test is used for the same purpose.
      TABLE 6.45
      Univariate Analysis Using LR Method for HSF
      Metric        B         SE          Sig.    Exp(B)       R2
      TABLE 6.46
      Univariate Analysis Using LR Method for MSF
      Metric        B         SE        Sig.      Exp(B)      R2
      CBO          0.276     0.030     0.0001     1.318      0.375
      WMC          0.065     0.011     0.0001     1.067      0.215
      RFC          0.025     0.004     0.0001     1.026      0.196
      SLOC         0.010     0.001     0.0001     1.110      0.392
      LCOM         0.009     0.003     0.0050     1.009      0.116
      NOC         −1.589     0.393     0.0001     0.204      0.090
      DIT          0.058     0.092     0.5280     1.060      0.001
      TABLE 6.47
      Univariate Analysis Using LR Method for LSF
      Metric       B         SE          Sig.      Exp(B)      R2
      CBO         0.175     0.025      0.0001       1.191     0.290
      WMC         0.050     0.011      0.0001       1.052     0.205
      RFC         0.015     0.004      0.0001       1.015     0.140
      SLOC        0.004     0.001      0.0001       1.004     0.338
      LCOM        0.004     0.003      0.2720       1.004     0.001
      NOC        −0.235     0.192      0.2200       0.790     0.002
      DIT         0.148     0.099      0.1340       1.160     0.005
        TABLE 6.48
        Univariate Analysis Using LR Method for USF
        Metric          B     SE        Sig.     Exp(B)      R2
        CBO        0.274     0.029     0.0001    1.315      0.336
        WMC        0.068     0.019     0.0001    1.065      0.186
        RFC        0.023     0.004     0.0001    1.024      0.127
        SLOC       0.011     0.002     0.0001    1.011      0.389
        LCOM       0.008     0.003     0.0100    1.008      0.013
        NOC        −0.674    0.185     0.0001    0.510      0.104
        DIT        0.086     0.091     0.3450    1.089      0.001
Data Analysis and Statistical Testing                                                             265
results of univariate analysis for predicting fault proneness with respect to high-severity faults
(HSF). From Table 6.45, we can see that five out of seven metrics were found to be very signifi-
cant (Sig. < 0.01). However, NOC and DIT metrics are not found to be significant. The LCOM
metric is significant at 0.05 significance level. The value of R2 statistic is highest for SLOC and
CBO metrics.
   Table  6.46  summarizes the results of univariate analysis for predicting fault proneness
with respect to medium-severity faults (MSF). Table 6.46 shows that the values of R2 statistic
is the highest for SLOC metric. All the metrics except DIT are found to be significant. NOC
has a negative coefficient, which implies that classes with higher NOC value are less fault
prone.
   Table 6.47 summarizes the results of univariate analysis for predicting fault proneness
with respect to low-severity faults (LSF). Again, it can be seen from Table  6.47  that the
value of R 2  statistic is highest for SLOC metric. The results show that four out of seven
metrics are found to be very significant. LCOM, NOC, and DIT metrics are not found to
be significant.
   Table 6.48 summarizes the results of univariate analysis for predicting fault proneness. The
results show that six out of seven metrics were found to be very significant when the faults
were not categorized according to their severity, that is, ungraded severity faults (USF). The
DIT metric is not found to be significant and the NOC metric has a negative coefficient. This
shows that the NOC metric is related to fault proneness but in an inverse manner.
   Thus, the SLOC metric has the highest R2 value at all the severity of faults, which shows
that it is the best predictor. The CBO metric has the second highest R2 value. The values of
R 2 statistic are more important as compared to the value of sig. as they show the strength
of the correlation.
Exercises
  6.1 Describe the measures of central tendency? Discuss the concepts with
     examples.
  6.2 Consider the following data set on faults found by inspection technique for a
     given project. Calculate mean, median, and mode.
     100, 160, 166, 197, 216, 219, 225, 260, 275, 290, 315, 319, 361, 354, 365, 410, 416, 440, 450,
     478, 523
  6.3 Describe the measures of dispersion. Explain the concepts with examples.
  6.4 What is the purpose of collecting descriptive statistics? Explain the importance
     of outlier analysis.
  6.5 What is the difference between attribute selection and attribute extraction
     techniques?
  6.6 What are the advantages of attribute reduction in research?
  6.7 What is CFS technique? State its application with advantages.
  6.8 Consider the data set consisting of lines of source code given in exercise 6.2.
     Calculate the standard deviation, variance, and quartile.
  6.9 Consider the following table presenting three variables. Determine the normality
     of these variables.
266                                                      Empirical Research in Software Engineering
                                                     Cyclomatic       Branch
                                 Fault Count         Complexity       Count
                                 332                    25             612
                                 274                    24             567
                                 212                    23             342
                                 106                    12             245
                                 102                    10             105
                                 93                     09              94
                                 63                     05              89
                                 23                     04              56
                                 09                     03              45
                                 04                     01              32
  6.10 What is outlier analysis? Discuss its importance in data analysis. Explain uni-
     variate, bivariate, and multivariate.
  6.11 Consider the table given in exercise 6.7. Construct box plots and identify univari-
     ate outliers for all the variables given in the data set.
  6.12 Consider the data set given in exercise 6.7. Identify bivariate outliers between
     dependent variable fault count and other variables.
  6.13 Consider the following data with the performance accuracy values for different
     techniques on a number of data sets. Check whether the conditions of ANOVA are
     met. Also apply ANOVA test to check whether there is significant difference in the
     performance of techniques.
                                                        Techniques
                     Data Sets         Technique 1       Technique 2     Technique 3
                     D1                    84                71                 59
                     D2                    76                73                 66
                     D3                    82                75                 63
                     D4                    75                76                 70
                     D5                    72                68                 74
                     D6                    85                82                 67
                                                         Data Sets #
                       Algorithms               1                 2             3
                       Algorithm 1               9             7                9
                       Algorithm 2              19            20               20
                       Algorithm 3              18            15               14
                       Algorithm 4              13             7                6
                       Algorithm 5              10             9                8
Data Analysis and Statistical Testing                                                                   267
  6.15 A software company plans to adopt a new programming paradigm, that will
     ease the task of software developers. To assess its effectiveness, 50 software devel-
     opers used the traditional programming paradigm and 50 others used the new
     one. The productivity values per hour are stated as follows. Perform a t-test to
     assess the effectiveness of the new programming paradigm.
                                                      Old                 New
                                                 Programming          Programming
                        Statistic                  Paradigm             Paradigm
                                    P1             1,739          1,690
                                    P2             2,090          2,090
                                    P3               979            992
                                    P4               997            960
                                    P5             2,750          2,650
                                    P6               799            799
                                    P7               980          1,000
                                    P8             1,099          1,050
                                    P9             1,225          1,198
                                    P10              900            943
  6.17 The software team needs to determine average number of methods in a class
     for a particular software product. Twenty-two  classes were chosen at random
     and the number of methods in these classes were analyzed. Evaluate whether the
     hypothesized mean of the chosen sample is different from 11 methods per class for
     the whole population.
Class No. No. of Methods Class No. No. of Methods Class No. No. of Methods
  6.18 A software organization develops software tools using five categories of pro-
     gramming languages. Evaluate a goodness-of-fit test on the data given below to
268                                                             Empirical Research in Software Engineering
      test whether the organization develops equal proportion of software tools using
      the five different categories of programming languages.
                                            Programming          Number of
                                            Language             Software
                                            Category               Tools
                                            Category 1                35
                                            Category 2                30
                                            Category 3                45
                                            Category 4                44
                                            Category 5                28
  6.19 Twenty-five students developed the same program and the cyclomatic
     complexity values of these 25  programs are stated. Evaluate whether the
     cyclomatic complexity values of the program developed by the 25 students fol-
     lows normal distribution.
6, 11, 9, 14, 16, 10, 13, 9, 15, 12, 10, 14, 15, 10, 8, 11, 7, 12, 13, 17, 17, 19, 9, 20, 26, 6, 11, 9, 14, 16,
                                                                         Methodology
                                                                   OO          Procedural          Total
                 Software                Requirements               80             100             180
                  Development            Initial design             50             110             160
                  Stage                  Detailed design            75              65             140
                 Total                                             205             275             480
  6.21 The coupling values of a number of classes are provided below for two different
     samples. Test the hypothesis using F-test whether the two samples belong to the
     same population.
                  Sample 1        32      42       33      40       42      44       42      38       32
                  Sample 2        31      31       31      35       35      32       30      36
                            1                          25               45
                            2                          15               55
                            3                          25               65
                            4                          15               65
                            5                           5               35
                            6                          35               15
                            7                          45               45
                            8                           5               75
                            9                          55               85
                                                      Algorithms
                                        Data Sets      A1       A2
                                        D1             0.65     0.55
                                        D2             0.78     0.85
                                        D3             0.55     0.70
                                        D4             0.60     0.60
                                        D5             0.89     0.70
  6.24 Two attribute selection techniques were analyzed to check whether they have
     any effect on model’s performance. Seven models were developed using attribute
     selection technique X and nine models were developed using attribute selection
     technique Y. Use Wilcoxon–Mann–Whitney test to evaluate whether there is any
     significant difference in the model’s performance using the two different attribute
     selection techniques.
                                57.5                             58.9
                                58.6                             58.0
                                59.3                             61.5
                                56.9                             61.2
                                58.4                             62.3
                                58.8                             58.9
                                57.7                             60.0
                                                                 60.9
                                                                 60.4
  6.25 A researcher wants to find the effect of the same learning algorithm on
     three data sets. For every data set, a model is predicted using the same learn-
     ing algorithm with a specific performance measure area under the ROC curve.
270                                                      Empirical Research in Software Engineering
                                      1                     0.76
                                      2                     0.85
                                      3                     0.66
  6.26 A market survey is conducted to evaluate the effectiveness of three text editors
     by 20  probable customers. The customers assessed the text editors on various
     criteria and provided a score out of 300. Test the hypothesis whether there is any
     significant differences among the three text editors using Kruskal–Wallis test.
                                                     Methods
                                Data Sets   A1       A2        A3       A4
                                D1          0.65     0.56      0.72     0.55
                                D2          0.79     0.69      0.69     0.59
                                D3          0.65     0.65      0.62     0.60
                                D4          0.85     0.79      0.66     0.76
                                D5          0.71     0.61      0.61     0.78
                                                  Tools
                                 Data Sets   T1   T2      T3   T4
                                 Model 1     69   60      83   73
                                 Model 2     70   68      81   69
                                 Model 3     73   54      75   67
                                 Model 4     71   61      91   79
                                 Model 5     77   59      85   69
                                 Model 6     73   56      89   77
Further Readings
The following books provide details on summarizing data:
There are several books on research methodology and statistics in which various concepts
and statistical tests are explained:
V. Barnett, and T. Price, Outliers in Statistical Data, John Wiley & Sons, New York, 1995.
A detailed description of various wrapper and filter methods can be found in:
Some of the useful facts and concepts of significance tests are presented in:
  P. M. Bentler, and D. G. Bonett, “Significance tests and goodness of fit in the analysis
     of covariance structures,” Psychological Bulletin, vol. 88, no. 3, pp. 588–606, 1980.
  J. M. Bland, and D. G. Altman, “Multiple significance tests: The Bonferroni method,”
     BMJ, vol. 310, no. 6973, pp. 170, 1995.
  L. L. Harlow, S. A. Mulaik, and J. H. Steiger, What If There Were No Significance Tests,
     Psychology Press, New York, 2013.
Data Analysis and Statistical Testing                                                    273
   1. Independent variables
   2. Dependent variables
   3. Learning technique
                                                                                          275
276                                                                                 Empirical Research in Software Engineering
                                                          Data set
                 S.no.      Attr1      Attr2            ...      ...          ...       Dependent
                   1
                   2
                  ...        ...             ...        ...                                    ...
                  ...        ...             ...        ...                                    ...
                  ...        ...             ...        ...                                    ...
                  50         ...             ...        ...                                    ...
                                                                                    Apply attribute reduction techniques
                                                      Reduced set of
                                                   independent variables
Training Testing
                                                                           n
                                                                     idatio                                                 Actual
           Apply learning                                        val                                                        output
                                                            o del ds
            techniques                                    m      o                                                         variable
                                                       ply meth
                                                    Ap
                                                                                                                    Apply
                Predicted                                        Predicted output
                                                                                                                 performance
                 model                                               variable
                                                                                                                  measures
                                                                                                                  Hypothesis
                                   Decision on accuracy of the model                                               testing
FIGURE 7.1
Steps in model prediction.
Original data
FIGURE 7.2
Data partition.
techniques. Figure  7.2  shows that the original data can be divided into training and
testing data samples. The model is developed using training data and validated on the
unseen test data. In cross-validation, the data is split into two independent parts, one
for training and the other for testing. The study may also divide the data into training,
validation, and testing samples in empirical studies where the data set available is very
large. The validation data can be used to choose the correct architecture as an optional
step. However, in this book, we describe the concepts in terms of training and testing
samples.
  During model development, the data must be randomly divided into training and test-
ing data samples.
Dependent variable
                                        Categorical
                                        • Sensitivity or recall
                                        • Specificity
                                        • Accuracy
                                        • Precision
                                        • G-measure
                                        • Area under the curve
                                        Continuous
                                        • Mean relative error
                                        • Mean absolute relative error
                                        • Correlation coefficient
FIGURE 7.3
Performance measures for dependent variable.
Figure 7.3 shows the performance measures that may be used depending on the type of
dependent variable. The type of dependent variable can be either categorical or continu-
ous and the nature of dependent variable is determined by the distribution of the outcome
variable (ratio of positive and negative samples). The guidelines on the selection of perfor-
mance measures on the basis of nature of dependent variables are given in Section 7.5.7. The
cross-validation method is applied for model validation. Depending on the size of data, the
appropriate cross-validation method is selected.
      Product metrics
      (object-oriented                                                        Process
                                    Bug/change                                                        Bug/change
          metrics)                                                            metrics
                                       data                                                              data
                         Input                                                             Input
                                                         Software
                                                          quality
                                                       Observables
                                                 (sensitivity, specificity,
                                                  precision, area under
                                                   the curve, accuracy,
                                                 F-measure, G-measure)
FIGURE 7.4
Software quality assessment framework.
the various observables such as maintainability, fault proneness, and reliability that can be
used to provide information about the software quality.
  A software quality prediction system takes as an input the various product-oriented
metrics that define various characteristics of a software. The focus of this study is object-
oriented (OO) software. Thus, OO metrics are used, throughout the software process, to
determine and quantify various aspects of an OO application. A single metric alone is
not sufficient to uncover all characteristics of an application under development. Several
metrics must be used together to gauge a software product. The metrics used in this
study are the ones that are most commonly used by various researchers to account for
software characteristics like size, coupling, cohesion, inheritance, and so on. Along with
metrics, the collection of fault/change-prone data of a class is also an essential input
to create an efficient and intelligent classifier prediction system. The prediction system
learns to distinguish and identify fault/change-prone classes of the software data set
under study.
  Development of a software quality prediction system helps in ascertaining software
quality attributes and focused use of constraint resources. It also guides researchers and
practitioners to perform preventive actions in the early phases of software development
and commit themselves for creation of better quality software. Once a software quality
prediction system is trained, it can be used for quality assessment and for predictions
on future unseen data. These predictions are then utilized for assessing software quality
processes and procedures as we evaluate the software products, which are a result of these
processes.
280                                                     Empirical Research in Software Engineering
                                  y = a + b1x1 + b2 x2 +  + bn xn
	
where:
 a is constant
 b1…bn are weights
 x1…xn are independent variables
The weights are generated in such a way that the predicted values are closest to the actual
value. Closeness of predicted values to the actual value can be measured using ordinary
linear squares where the sum squared difference between predicted and actual value is
kept to a minimum. The difference between the actual and observed predicted values is
known as prediction errors. Thus, the linear regression model that best fits the data for
predicting dependent variable is such that the sum of squared errors are minimum.
  LR is used to predict the dependent variable from a set of independent variables to deter-
mine the percentage of variance in the dependent variable explained by the independent vari-
able (a detailed description is given by Basili et al. [1996], Hosmer and Lemeshow [1989], and
Aggarwal et al. [2009]). The multivariate LR formula can be defined as (Aggarwal et al. 2009):
                                                         e( o 1 1      n n)
                                                           A + A X ++ A X
                          prob ( X1 , X2 ,… , Xn ) =
                                                       1 + e( o 1 1
                                                             A + A X ++ An Xn )
	
where:
 Xi , i = 1, 2, , n are the independent variables
 “Prob” is the probability of occurrence of an event
using the coefficient. The higher the value of the coefficients, more is the impact of the
independent variables.
  In multivariate analysis two stepwise selection methods—forward selection and
backward elimination—are used (Hosmer and Lemeshow 1989). The forward stepwise
procedure examines the variables that are selected one at a time for entry at each step.
The backward elimination method includes all the independent variables in the model.
Variables are deleted one at a time from the model, until a stopping criteria is fulfilled.
  The statistical significance defines the significance level of a coefficient. Larger the value
of statistical significance, lesser is the estimated impact of an independent variable on the
dependent variable. Usually, the value of 0.01 or 0.05 is used as threshold cutoff value.
  •   DT
  •   Bayesian learners (BL)
  •   Ensemble learners (EL)
  •   NN
  •   SVM
  •   Rule-based learning (RBL)
  •   Search-based techniques (SBT)
  •   Miscellaneous
282                                                           Empirical Research in Software Engineering
                    DT                  BL               EL                   SBT
                                                                                           Hybrid: EDER-SD,
                                                                                           SA-PNN
                              ADT                WNB
                                                                   Bagging                 AntMiner
                              J48                 ANB
                                                                    LB                      MOPSO
                               C4.5                NB
                                                                     AB                     ACO
                                CART               BN
                                                                     RF
                                                     TBN                                     GP
                                 CHAID
                                                                        LMT
                                                                                                  Machine
                                                                                     IB1          learning
                               MLP                               OneR
                                                                                    RP
                              RBF                               NNge
                                                                                VF1
                            PNN                                Ripper
                                                                               VP
                           CNN                                DTNB
                                                                               Kstar, KNN, IBK
                 NN               SVM              RBL         Miscellaneous
FIGURE 7.5
Classification of ML techniques. CHAID: chi-squared automatic interaction detection, CART: classification and
regression trees, ADT: alternating decision tree, RF: random forest, NB: naïve Bayes, BN: Bayesian networks, ABN: aug-
mented Bayesian networks, WNB: weighted Bayesian networks, TNB: transfer Bayesian networks, MLP: multilayer
perceptron, PNN: probabilistic neural network, RBF: radial basis function, LB: Logit boost, AB: AdaBoost, NNge:
neighbor with generalization, GP: genetic programming, ACO: ant colony optimization, SVM: support vector
machines, RP: recursive partitioning, AIRS: artificial immune system, KNN: K-nearest neighbor, VFI: Voting Feature
Intervals, EDER-SD: evolutionary decision rules for subgroup discovery, SA-PNN: simulated annealing probabilis-
tic neural network, VP: voted perceptron, DTNB: decision tree naive bayes, LMT: logistic model trees. (Data from
Malhotra, 2015.)
FIGURE 7.6
DT algorithm.
obtain the final outcome by taking a vote. Rather than using a single ML technique, this
approach aims to improve the accuracy of model by combining the results obtained by
multiple ML techniques. It is proved that the multiple ML techniques give more accurate
results, rather than using individual ML technique. Figure 7.7 depicts the concept of EL.
There are various techniques based on EL such as boosting, bagging, and RF. Table 7.1
summarizes the widely used EL techniques.
                                                          Training
                                                            data
Combine
Output
FIGURE 7.7
Ensemble learning.
TABLE 7.1
Ensemble Learning Techniques
Technique                                                    Description
RF             RF was proposed by Breiman (2001) and constructs a forest of multiple trees and each tree
                depends on the value of a random vector. For each of the tree in the forest, this random vector
                is sampled with the same distribution and independently. Hence, RF is a classifier that
                consists of a number of decision trees.
Boosting       Boosting uses DT algorithm for creating new models. Boosting assigns weights to models
                based on their performance. There are many variants of boosting algorithms available in the
                literature. There are two variants of boosting technique—AdaBoost (Freund and Schapire
                1996) and LogitBoost (Friedman et al. 2000).
Bagging        Bagging or bootstrap aggregating improves the performance of classification models by
                creating various sets of the training sets.
training sets. During this process, the architecture is determined, such as the number of
hidden layers and number of nodes in the hidden layer (Figure 7.8). Usually, one hidden
layer is used in research as what can be achieved in function approximation, with more
than one hidden layer also achieved by one hidden layer (Khoshgaftaar 1997).
  The weights between the jth hidden node and input nodes are represented by Wji, while
the weights between the jth hidden node and output node are represented by αj. The thresh-
old of the jth hidden node is represented by βj, while the threshold of the output layer is
represented by β. If x represents the input vector to the network, the net input to the hid-
den node j is given by (Haykin 1994):
                                            M
	                                 net j =   ∑W x + B ;
                                            i =1
                                                   ji i      j     j = 1, 2, N
Model Development and Interpretation                                                     285
                                                                               Output
         signals
                                                                               signals
          Input
                                                   ...
                       ...
                                                                     Output
                                                                      layer
                      Input
                      layer                   Hidden
                                               layer
FIGURE 7.8
Architecture of NN.
                                          σ j = σ ( net j )
	
The output from the network is given by:
                                              N                 
                                     y = σ
                                             ∑ α σ + βj   j
 j =1 
FIGURE 7.9
Radial basis function.
  Given a set of (xi, yi),…, (xm, ym) and yi ε {−1, +1} training samples. αi = (i = 1,…, m) is a
lagrangian multiplier. K(xi,yi) is called a kernel function and b is a bias. The discriminant
function D of two class SVM is given below (Zhao and Takagi 2007):
                                                    m
                                          D(x) =   ∑ y α K (x , x) + b
                                                          i   i     i
i =1
                                               +1, if D ( x ) > 0
                                             x=
	                                                −1, if D ( x ) < 0
FIGURE 7.10
Basic algorithm for rule-based learning
Model Development and Interpretation                                                    287
Harman and Jones (2001) advocated the application of the SBT for predictive model-
ing work, as SBT will allow software engineers to balance constraints and conflicts in
search space because of noisy, partially inaccurate, and incomplete data sets. A system-
atic review of studies was performed on software quality prediction which reported that
there are few studies that assess the predictive performance of SBT for defect prediction
and change prediction (Malhotra 2014a; Malhotra and Khanna 2015). Thus, future stud-
ies should employ SBT to evaluate their capability in the area of defect and change model
prediction. An important factor while developing a prediction model is its runtime
288                                                         Empirical Research in Software Engineering
                                            Initial
                                          population
                                Assess the
                            performance using
                             fitness function
                                           Cr
                          n
                         tio
                                            os
                                                so
                   lec
                                                ve
                  Se
                  Re
                    pr
                                                 on
                       od
                                                ati
                          uc
                             tio
                                            ut
                                           M
No
                   Updated
                                                       Stopping      Yes
                  population
                                                       criteria?               Terminate
FIGURE 7.11
Process of search-based techniques.
(speed). The systematic review also revealed that the SBT require higher running time
for model development as compared to the ML techniques, but parallel or cloud search-
based software engineering (SBSE) can lead to promising results and significant time
reduction (Di Geronimo 2012; White 2013).
  A summary of characteristics of ML techniques is given in Table 7.2.
Model Development and Interpretation                                                                  289
     TABLE 7.2
     Characteristics of ML Techniques
     Technique Name                                    Characteristics
  • Data must be preprocessed using outlier analysis, normality tests and so on. It
    may help in increasing the accuracy of the models. Section 6.1 presents the pre-
    processing techniques.
  • The model must be checked for multicollinearity effects (see Section 7.4.2).
  • Dealing with imbalanced data (see Section 7.4.3).
  • A suitable learning technique must be selected for model development (see
    Section 7.4.4).
  • The training and test data must be as independent as possible, as new data is
    expected to be applied for model validation.
  • The parameter setting of ML techniques may be adjusted (not over adjusted) and
    should be carefully documented so that repeatable studies can be conducted (see
    Section 7.4.5).
                 TABLE 7.3
                 Recommended Solution to Learning Problems
                 Issue                                    Remedy
model predicted. Principal component method (or P.C. method) is a standard technique
used to find the interdependence among a set of variables. The factors summarize the
commonality of these variables, and factor loadings represent the correlation between the
variables and the factor. P.C. method maximizes the sum of squared loadings of each factor
extracted in turn (Aggarwal et al. 2009). The P.C. method is applied to these variables to
find the maximum eigenvalue, emax, and minimum eigenvalue, emin. The conditional num-
ber is defined as λ = emin emax . If the value of the conditional number exceeds 30  then
multicollinearity is not tolerable (Belsley et al. 1980).
   Variance inflation factor (VIF) is used to estimate the degree of multicollinearity in pre-
dicted models. R2s are calculated using ordinary least square regression method and VIF
is defined below:
                                                    1
                                         VIF =
	                                                1 − R2
According to literature, VIF value less than 10 is tolerable.
Understandable
Accurate Simple
                                                Learning
                                                algorithm
Interpretable Fast
Scalable
FIGURE 7.12
Properties of learning algorithms.
The predictions made by the model are with respect to the classes of the outcome vari-
able (also referred as dependent variable) of a problem, which is under consideration.
For example, if the outcome variable of the problem has two classes, then that problem is
referred to as a binary problem. Similarly, if the outcome variable has three classes, then
that problem is known as a three-class problem, and so on.
  Consider the confusion matrix given in Table 7.4 for a two-class problem, where the out-
come variable consists of positive and negative values.
  The following measures are used in the confusion matrix:
  •   True positive (TP): Refers to the number of correctly predicted positive instances
  •   False negative (FN): Refers to the number of incorrectly predicted positive instances
  •   False positive (FP): Refers to number of incorrectly predicted negative instances
  •   True negative (TN): Refers to number of correctly predicted negative instances
Now, consider a three-class problem where an outcome variable consists of three classes,
C1, C2, and C3,, as shown in Table 7.5.
  From the above confusion matrix, we will get the values of TP, FN, FP, and TN corre-
sponding to each of the three classes, C1, C2, and C3, as shown in Figures 7.6 through 7.8.
  Table 7.6 depicts the confusion matrix corresponding to class C1. This table is derived
from Table 7.5, which shows the confusion matrix for all the three classes C1, C2, and C3. In
Table 7.6, the number of TP instances are “a,” where “a” are the class C1 instances that are
correctly classified as belonging to class C1. The “b” and “c” are the class C1 instances that
are incorrectly labeled as belonging to class C2 and class C3, respectively. Therefore, these
instances come under the category of FN. On the other hand, d and g are the instances
belonging to class C2  and class C3, respectively, and they have been incorrectly marked
as belonging to class C1 by the prediction model. Hence, they are FP instances. The e, f, h,
and i are all the remaining samples that are correctly classified as nonclass C1  instances.
                             TABLE 7.4
                             Confusion Matrix for Two-Class
                             Outcome Variables
                                                       Predicted
                                                  Positive    Negative
                             Actual    Positive     TP            FN
                                       Negative     FP            TN
                              TABLE 7.5
                              Confusion Matrix for Three-Class
                              Outcome Variables
                                                       Predicted
                                                  C1         C2    C3
                              Actual       C1      a         b     c
                                           C2      d         e     f
                                           C3      g         h     i
294                                                   Empirical Research in Software Engineering
                         TABLE 7.6
                         Confusion Matrix for Class “C1”
                                                      Predicted
                                                 C1               Not C1
                        TABLE 7.7
                        Confusion Matrix for Class “C2”
                                                          Predicted
                                                C2                    Not C2
                       TABLE 7.8
                       Confusion Matrix for Class “C3”
                                                          Predicted
                                                C3                    Not C 3
Therefore, they are referred to as TN instances. Similarly, Tables 7.7 and 7.8 depict the con-
fusion matrix for classes C2 and C3.
                                                               TP
                          Sensitivity or recall(Rec) =             ×100
	                                                            TP+FN
But, the important point to note here is that this value comments nothing about the other
instances, which do not belong to class C, but are still incorrectly classified as belonging
to class C.
  Specificity is defined as the ratio of correctly classified negative instances to the total
number of actual negative instances. It is given by the following formula:
                                                     TN
                                   Specificity =         ×100
	                                                  FP+TN
Model Development and Interpretation                                                          295
Ideally, the value of both sensitivity and specificity should be as high as possible. Low
value of sensitivity specifies that there are many high-risk classes (positive classes)
that are incorrectly classified as low-risk classes. Low value of specificity specifies
that there are many low-risk classes (negative classes) that are incorrectly classified as
high-risk classes (Aggarwal et al. 2009). For example, consider a two-class problem in
a software organization where a module may be faulty or not faulty. In this case, low
sensitivity would result in delivery of software with faulty modules to the customer,
and low specificity would result in the wastage of the organization’s resources in test-
ing the software.
                                              TP+TN
                              Accuracy =                 ×100
	                                          TP+FN + FP+TN
Precision measures how many positive predictions are correct. It is defined as the ratio of
actual correctly predicted positives instances to the total predicted positive instances. In a
classification task, a precision of 100% for a class C means that all the instances that belong
to class C are correctly classified as belonging to class C. But, the value comments nothing
about the other instances that belong to class C and are not correctly predicted.
                                                    TP
                                 Precision(Pre)=         ×100
	                                                  TP+FP
                                                 2 × Pre × Rec
                                   F-measure =
	                                                  Pre × Rec
  G-measure represents the harmonic mean of recall and (100-false positive rate [FPR])
and is defined as given below:
where:
 FPR is defined as the ratio of incorrectly predicted positive instances that are actually
       negative instances to the total actually negative instances and is given below
                                                   FP
                                        FPR =            × 100
                                                 FP + TN
G-mean is popularly used in an imbalanced data set, where the effect of negative cases
prevails. It is the combination of two evaluations, namely, the accuracy of positives (a+)
and the accuracy of negatives (a–) (Shatnawi 2010). Therefore, it keeps a balance between
both these accuracies and is high if both the accuracies are high. It is defined as follows:
                                   TP             TN
                         a+ =            ; a− =         ;g=       (a +) × (a −)
	                                TP + FP        TN + FN
      Example 7.1:
      Let us consider an example system consisting of 1,276 instances. The independent vari-
      ables of this data set are the OO metrics belonging to the popularly used Chidamber
      and Kemerer (C&K) metric suite. The dependent variable has two values, namely, faulty
      or not faulty. In other words, this data set depicts whether a particular module of soft-
      ware contains a fault or not. If a module is containing a fault, then the value of the
      outcome variable corresponding to that module is 1. On the other hand, if a module is
      not faulty, then the value of the outcome variable for that module is 0. Now, this data
      set is used to predict the model. The observed and the predicted values of the outcome
      variable thus obtained are then used to construct the confusion matrix to evaluate the
      performance of the model. Confusion matrix obtained from the results is thus given
      in Table 7.9. Compute values of performance measures sensitivity, specificity, accuracy,
      precision, F-measure, G-measure, and G-mean based on Table 7.9.
      Solution:
      The values of different measures to evaluate the performance of the prediction model
      when the dependent variable is of categorical type are shown below in Table 7.10.
                        TABLE 7.9
                        Confusion Matrix for Binary Categorical Variable
                                                             Predicted
                                                     Faulty (1)   Not Faulty (0)
TABLE 7.10
Performance Measures for Confusion Matrix given in Table 7.7
Performance Measures                         Formula                            Values Obtained        Results
                       TABLE 7.11
                       Confusion Matrix for Three-Class Outcome Variable
                                                                       Predicted
                                                      High (1)         Medium (2)         Low (3)
       Solution:
       From the confusion matrix given in Table  7.11, the values of TP, FN, FP, and TN are
       derived and corresponding to each of the three classes high (1), medium (2), and low (3),
       and are shown in Tables 7.12 through 7.14.
The value of different performance measures at each severity level, namely, high, medium,
and low on the basis of Tables 7.12 through 7.14 are given in Table 7.15.
                                    TABLE 7.12
                                    Confusion Matrix for Class “High”
                                                                  Predicted
                                                              High       Not High
                        TABLE 7.13
                        Confusion Matrix for Class “Medium”
                                                             Predicted
                                                    Medium       Not Medium
                                 TABLE 7.14
                                 Confusion Matrix for Class “Low”
                                                        Predicted
                                                     Low      Not Low
             Sensitivity/                                                 Precision
            Recall (Rec)      Specificity           Accuracy                (Pre)     F-Measure         a+        a−                            FPR               G-Measure
Category
              TP               TN                 TP + TN                    TP       2 × Pre × Rec     TP        TN       G-Mean             FP            2 × Recall × ( 100 − FPR )
(Severity            × 100            × 100                       × 100                                                                             × 100
of Fault)   TP + FN          FP + TN          TP + FN + FP + TN           TP + FP      Pre + Rec      TP + FP   TN + FN   (a + ) × (a − )   FP + TN          Recall + ( 100 − FPR )
High            25.0             91.66                78.33                 0.428        0.316         0.428     0.830       0.595              8.3                   49.86
Medium          89.47            40.90                71.66                 0.723        0.799         0.723     0.692       0.707             59.09                  56.14
Low             50.0             98.0                 90.0                  0.833        0.625         0.833     0.907       0.868              2.00                  66.21
                                                                                                                                                                                         299
300                                                              Empirical Research in Software Engineering
1.0
0.8
                                         0.6
                           Sensitivity
                                         0.4
0.2
                                         0.0
                                               0.0   0.2   0.4       0.6   0.8   1.0
                                                           1-specificity
FIGURE 7.13
Example of an ROC curve.
lead to a decrease in the value of specificity. ROC curve starts from the origin and moves
toward the upper-right portion of the graph, as can be seen from Figure 7.13. The ends
of the curve meet the end points of the diagonal line. The closer the curve is toward the
left-hand border and the top border of the ROC graph, the more accurate is the prediction
capability of the model. In contrast, the closer the curve comes to the 45-degree diagonal
of the ROC graph, the less accurate is the model prediction.
                            TABLE 7.16
                            Co-Ordinates of the ROC Curve
                            Cutoff Point   Sensitivity   1-Specificity
     Example 7.2:
     Consider an example to compute AUC using ROC analysis. In this example, OO metrics
     are taken as independent variables and fault proneness is taken as the dependent vari-
     able. The model is predicted by applying an ML technique. Table 7.17 depicts actual and
     predicted dependent variable. Use ROC analysis for the following:
         1. Identify AUC.
         2. Based on the AUC, determine the predicted capability of the model.
     Solution:
     The value of the dependent variable is 0  if the module does not contain any fault,
     and its value is 1 if the module contains a fault. On the basis of this input, the ROC
     curve obtained using Statistical Package for the Social Sciences (SPSS) tool is shown
     in Figure 7.14, the value of AUC is 0.642 and the coordinates of the curve are depicted
     in Table 7.18. Table 7.18 shows the values of sensitivity and 1-specificity along with
     their corresponding cutoff points. The results show that AUC is 0.642. Hence, the
     model performance is not good (for interpretation of performance measures refer
     Section 7.9.1).
302                                                                Empirical Research in Software Engineering
                                    TABLE 7.17
                                    Example for ROC Analysis
                                               Predicted                       Predicted
                                    Actual    Probability        Actual       Probability
                                    1               0.055            0           0.061
                                    1               0.124            0           0.254
                                    1               0.124            0           0.191
                                    1               0.964            0           0.024
                                    1               0.124            0           0.003
                                    1               0.016            0           0.123
                                    0               0.052            1           0.123
                                    0               0.015            1           0.024
                                    0               0.125            1           0.169
                                    0               0.123            1           0.169
                                    1               0.052            1           0.169
1.0
0.8
                                  0.6
                    Sensitivity
0.4
0.2
                                  0.0
                                        0.0   0.2           0.4         0.6       0.8       1.0
                                                             1-specificity
FIGURE 7.14
The obtained ROC curve.
                         TABLE 7.18
                         Co-Ordinates of the ROC Curve
                         Cutoff Point    Sensitivity   1-Specificity
                         0                  1               1
                         0.009              1               0.9
                         0.015              1               0.8
                         0.020              0.917           0.8
                         0.038              0.833           0.7
                         0.056              0.750           0.6
                         0.092              0.750           0.5
                         0.123              0.667           0.3
                         0.124              0.417           0.3
                         0.147              0.417           0.2
                         0.180              0.083           0.2
                         0.222              0.083           0.1
                         0.609              0.083           0
                         1                  0               0
sensitive to the distributions in the data. In other words, accuracy is very sensitive to the
imbalances in a given data set. Any data set that exhibits unequal distribution of positive
and negative instances is considered as an imbalanced data (Malhotra 2015). Therefore,
as the class distribution will vary, the performance of the measure will also change even
though the performance of the learning technique remains the same. As a result, the accu-
racy measure will not be a true representative of the model performance. For example, if
data set contains maximum negative samples and all the samples are predicted as nega-
tive, the accuracy will be very high but the predicted model is useless. Hence, this measure
is not recommended to be used when there is a need to compare the performance of two
learning techniques over different data sets.
   Therefore, other measures popularly used in learning are precision, recall, F-measure,
and G-measure. We will first discuss precision and recall and see their behavior with
respect to imbalanced data. As we know, precision is a measure of exactness that deter-
mines the number of instances which are labeled correctly out of the total number of
instances labeled as positive. In contrast, recall is a measure of completeness that deter-
mines the number of positive class instances, which are labeled correctly. By these defi-
nitions, it is clear that both precision and recall have an inverse relationship with each
other and precision is sensitive to data distributions, whereas recall is not. But recall
is not able to give any information regarding the number of instances that are incor-
rectly labeled as positive. Similarly, precision does not tell anything about the number of
positive instances that are labeled incorrectly. Therefore, precision and recall are often
combined together to form a measure referred to as F-measure. F-measure is considered
as an effective measure of classification that provides an insight into the functionality
of a classifier, unlike the accuracy metric. However, F-measure is also sensitive to data
distributions. Another metric, the G-measure is also one of the popularly used evalua-
tion measure that is used to evaluate the degree of inductive bias, in terms of a ratio of
positive accuracy and negative accuracy. Although F-measure and G-measure are much
better than the accuracy measure, they are still not suitable to compare the performance
of different classifiers over a range of sample distributions.
304                                                     Empirical Research in Software Engineering
   To overcome the above issues, AUC curve generated by the ROC analysis is widely used as
the performance measure specifically for imbalanced data. AUC computed from ROC analy-
sis is widely used in medical diagnosis for the past many years, and its use is increasing in the
field of data mining research. Carvalho et al. (2010) advocated AUC to be the relevant criterion
for dealing with unbalanced and noisy data, as AUC is insensitive to the changes in distribu-
tion of class. He and Garcia (2009) have recommended the use of AUC for dealing the issues
of imbalanced data with regard to class distributions, it provides a visual representation of the
relative tradeoffs between the advantages (represented by TP) and costs (represented by FP)
of classification. In addition to, ROC curves for data sets that are highly skewed, a researcher
may use precision–recall (PR) curves. The PR curve is expressed as a plot of precision rate and
the recall rate (He and Garcia 2009). The ROC curves achieve maximum model accuracy in the
upper left-hand of the ROC space. However, a PR curve achieves maximum model accuracy
in the upper right-hand of the PR space. Hence, PR space can be used as an effective mecha-
nism for predicted model’s accuracy assessment when the data is highly skewed.
   Another shortcoming of ROC curves is that they are not able to deduce the statistical
significance of different model performance over varying class probabilities or misclas-
sification costs. To address these problems, another solution suggested by He and Garcia
(2009) is to use cost curves. A cost curve is an evaluation method that, like ROC curve,
visually depicts the model’s performance over varying misclassification of costs and class
distributions (He and Garcia 2009).
   In general, given the limitations of each performance measures, the researcher may use
multiple measures to increase the conclusion validity of the empirical study.
where:
 N is the total number of instances in a given data set
 Pi is the predicted value of an instance i
 Ai is the actual value of an instance i
                                                    N
                                                         |Pi − Ai |
                                                    ∑
                                                1
                                     MARE =
                                                N            Ai
	                                                   i =1
where:
 N refers to the total number of instances in a given data set
 Pi refers to the predicted value of an instance i
 Ai refers to the actual value of an instance i
                                                         d
                                          PRED ( A ) =
	                                                        N
where:
  N refers to the total number of instances in a given data set
  d is the number of instances having value of error less than or equal to “A” error
     Example 7.3:
     Consider an example to assess the performance of model predicted with lines of code
     (LOC) as outcome. Table 7.19 presents a data set consisting of ten instances that depict
     the LOC of a given software. The table shows the actual values of LOC and values of
     LOC that are predicted once the model has been trained. Calculate all the performance
     measures for the data given in Table 7.19.
     Solution:
     The difference between the predicted and the actual values has been shown in Table 7.20.
       Table 7.21 shows the values of the performance measures MRE, MARE, and Pred (A).
     MRE is the average of the values obtained after dividing the difference of the predicted
     and the actual values with the actual values. Similarly, MARE is the average of the values
     obtained after dividing the absolute difference of the predicted and the actual values with
                        TABLE 7.19
                        Actual and Predicted Values of Model Predicted
                        Module #           Actual (Ai)           Predicted (P i)
                        1                     100                      90
                        2                      76                      35
                        3                      45                      60
                        4                     278                     300
                        5                     360                      90
                        6                     240                     250
                        7                     520                     500
                        8                     390                     800
                        9                      50                      45
                        10                    110                      52
306                                                                Empirical Research in Software Engineering
          TABLE 7.20
          Actual and Predicted Values of Model Predicted
          Module #       Actual (Ai)          Predicted (P i)       P i – Ai         (P i – Ai)/Ai     |(P i – Ai)|/Ai
                     TABLE 7.21
                     Performance Measures
                     Performance Measure                     Values Obtained                         Result
                                 N
                                        Pi − Ai                       −0.55                          −0.055
                                 ∑
                             1                                  MRE =
                     MRE =                                             10
                             N   i =1
                                           Ai
                                        N
                                          |Pi − Ai |                       3.56                      0.356
                                     ∑
                            1                                   MARE =
                     MARE =                                                 10
                            N        i =1
                                              Ai
                                     d                                           5                    50%
                     PRED ( 25 ) =                              PRED ( 25 ) =
                                     N                                          10
                                     d                                           6                    60%
                     PRED ( 50 ) =                              PRED ( 50 ) =
                                     N                                          10
                                     d                                           9                    90%
                     PRED ( 75 ) =                              PRED ( 75 ) =
                                     N                                          10
      the actual values. Pred (A) is obtained by dividing the instances that have an error value
      (MRE) less than or equal to “A” error by the total number of instances in the data set. The
      Pred value is calculated at 25%, 50%, and 75% levels, and the results are shown in Table 7.21.
      The results show that 50% of instances have error less than 25%, 60% of instances have error
      less than 50% and 90% of instances have error less than 75%.
7.7 Cross-Validation
The accuracy obtained by using the data set from which the model is build is quite
optimistic. Cross-validation is a model evaluation technique that divides the given data set
into training and testing data in a given ratio and proportion. The training data is used to
train the model using any of the learning techniques available in the literature. This trained
model is then used to make new predictions for data it has not already seen, that is, the
testing data. The division of the data set into two parts is essential, as it will provide infor-
mation about how well the learner performs on the new data. The ratio by which the data
set is divided is decided on the basis of the cross-validation method used.
Model Development and Interpretation                                                          307
FIGURE 7.15
Hold-out validation.
308                                                          Empirical Research in Software Engineering
S2 S2 S2 S2 S2
S3 S3 S3 S3 S3
S4 S4 S4 S4 S4
S5 S5 S5 S5 S5
S6 S6 S6 S6 S6
S7 S7 S7 S7 S7
S8 S8 S8 S8 S8
S9 S9 S9 S9 S9
FIGURE 7.16
Tenfold cross-validation.
Trial 1
Trial 2
Trial 3
Trial n
FIGURE 7.17
Leave-one-out validation.
Model Development and Interpretation                                                      309
Model-1
ML-1
Model-n
Data-1
                                                           Model-1
                 Data-2                ML-2                             Statistical
                                                                           tests
Model-n
Data-n
Model-1
ML-n
Model-n
FIGURE 7.18
Model comparison using statistical tests.
                                      TABLE 7.22
                                      AUC of Models Predicted
                                      Data Set     Bagging      LR
   After applying paired t-test using the procedure given in Section 6.4.6.3, the t-statistic
is 3.087 (p-value = 0.037) and the test is significant at 0.05 significance level. Hence, null
hypothesis is rejected and the alternative hypothesis is accepted. The example above dem-
onstrates how statistical tests can be used for model comparison. The empirical study in
Section 7.11 describes the practical example of comparison of models using statistical tests.
Figure 7.19 depicts the list of questions that must be addressed while interpreting the results.
                                      ts
                                    ul ch                             W
                                r es ear                           ben ho wi
                                      s                               efit      l
                              he re ?                                      e l be
                         e s t the esis                              resu d by th
                      Do ort oth                                           lts?   e
                        pp yp
                      su h
                                e
                   What are th                                            How does the study
                                the
                 limitations of                                          relate to past studies?
                       study?
FIGURE 7.19
Issues to be addressed while result interpretation.
     sensitivity or recall is computed its value is 0%. Hence, in this case, recall is the
     most appropriate performance measure that represents the model accuracy.
   Case 2: Consider a scenario where the number of instances is 1,000  with 10  posi-
     tive samples and rest negative instances, if all the samples are predicted as posi-
     tive then precision is 1%, accuracy is 1%, specificity is 0%, and sensitivity is 100%.
     Hence, in this case, sensitivity is the most inappropriate performance measure
     that represents the model accuracy.
   Case 3: Another situation is if most of the instances are positive. Consider a scenario
     where the number of instances is 1,000 with 990 positive instances and rest negative
     instances, if all the instances are predicted as positive then precision is 99%, accu-
     racy is 99%, specificity is 0%, and sensitivity is 100%. Hence, in this case, specificity
     is the most appropriate performance measure that represents the model accuracy.
   Case 4: Consider a scenario where the number of instances is 1,000 with 990 positive
     instances and rest negative instances, if all the instances are predicted as nega-
     tive then precision is 0%, accuracy is 1%, specificity is 100%, and sensitivity is 0%.
     Hence, in this case, sensitivity, precision, and accuracy are the most appropriate
     performance measures that represent the model accuracy.
   Case 5: Consider 1,000 samples with 60 positive instances and 940 negative instances,
     where 50 instances are predicted correctly as positive, 40 are incorrectly predicted
     as positive, and 900 are correctly predicted as negative. For this example, sensitiv-
     ity is 83.33%, specificity is 95.74%, precision is 55.56%, and accuracy is 95%. Hence,
     in this case, precision is the most appropriate performance measure that represents
     the model accuracy.
312                                               Empirical Research in Software Engineering
                            TABLE 7.23
                            AUC Values
                            AUC Range                Guideline
                            0.50–0.60             No discrimination
                            0.60–0.70             Poor
                            0.70–0.80             Acceptable/good
                            0.80–0.90             Very good
                            0.90 and higher       Excellent
Hence, the problem domain has major influence on the values of performance measures,
and the models can be interpreted in the light of more than one performance measures.
AUC is another measure that provides a complete view of the accuracy of the model.
Guidelines for interpreting the accuracy of the prediction model based on the AUC are
given in Table 7.23.
Language          C++                               C++       C++         C++         C++         C++      C++      Java        C++              C++       C++
 used                                                                                                                                                                 Java
Technique         LR, ML (DT, ANN)                  LR        LR          LR          LR          LR       LR       OLS         LR, ML           LR, ML
 used                                                                                                                            (DT, ANN)        (NNage,
                                                                                                                                                  RF, NB)             LR
                                                                                                                                                                                      Model Development and Interpretation
Type of data      NASA data set                     Univ.     Comm.       Univ.       Comm.       Comm.             Comm.       Open source      NASA data
                                                                                                                                                  set                 Open source
Fault severity    Yes                               No        No          No          No          No                No          No               Yes
 taken?                                                                                                                                                               No
                  HSF      MSF      LSF     USF                                                   #1       #2                                    LSF/      HSF
                                                                                                                                                 USF
WMC               ++       ++       ++      ++      +         +           +           ++          +        0        ++          ++               ++        ++     ++ ++         ++
DIT               0        0        0       0       ++        0           ++          --          0        0        0           +                0         0      0  --         0
RFC               ++       ++       ++      ++      ++        +           ++          ++          ++       0        +           ++               ++        ++     ++ ++         ++
NOC               0        --       0       --      --        0           -           0                             ++          0                --               0        0    0
CBO               ++       ++       ++      ++      +         0           ++          ++          +        0        +           ++               ++        ++     ++ ++         0
LCOM              ++       ++       0       ++      0                                                                           +                +         ++     ++ ++         ++
LOC               ++       ++       ++      ++                            ++                      ++       ++                   ++               ++        ++
++ denotes metric is significant at 0.01, + denotes metric is significant at 0.05, -- denotes metric is significant at 0.01 but in an inverse manner, - denotes metric is signifi-
cant at 0.05 but in an inverse manner, 0 denotes that metric is not significant. A blank entry means that our hypothesis is not examined or the metric is calculated in a
different way. LR: logistic regression, OLS: ordinary least square, ML: machine learning, DT: decision tree, ANN: artificial neural network, RF: random forest, NB: naïve
Bayes, LSF: low-severity fault, USF: ungraded severity fault, HSF: high-severity fault, MSF: medium-severity faults, #1: without size control, #2: with size control,
comm.: commercial, univ.: university.
                                                                                                                                                                                      313
314                                                Empirical Research in Software Engineering
               TABLE 7.25
               Summary of Hypothesis
               Metric                            Hypothesis Accepted/Rejected
ungraded severities of faults. The completeness value of the DIT is worse with respect to
all the severities of faults. Table 7.24 shows that most of the studies found that the DIT met-
ric is not related to fault proneness. The class may have less number of ancestors in most of
the studies and is one of the reasons for nonrelation of the DIT metric with fault proneness,
and further investigation is needed. The null hypothesis for the DIT metric is accepted and
the alternative hypothesis is rejected.
  Weighted methods per class (WMC) hypothesis is found to be significant in the LR and
ANN analysis. On the other hand, Basili et al. (1996) found it to be less significant. In the
study conducted by Yu et al. (2002), WMC metric was found to be a significant predictor
of fault proneness. Similar to the regression and DT and ANN results in this study, Briand
et al. (2000), Gyimothy et al. (2005), Olague et al. (2007), and Zhou and Leung (2006) also
found the WMC metric as one of the best predictors of fault proneness. Rest of the studies
found it to be a significant predictor, but at 0.05 significance level.
  All of the three models found the WMC metric to be significant predictor of fault prone-
ness. Hence, the null hypothesis is rejected and the alternative hypothesis is accepted.
  LOC hypothesis is found to be significant in the LR analysis. It was also found signifi-
cant in all the studies that examined them. Hence, the null hypothesis is rejected and the
alternative hypothesis is accepted.
  In Table 7.25, summary of the results of the hypothesis stated in Section 4.7.6 with respect
to each severity of faults. The LCOM metric was found significant at 0.05 level in the study
conducted by Zhou and Leung (2006) with regard to low and ungraded severity levels
of faults. However, in this study, the LCOM metric is found significant at 0.01 level with
respect to the HSF, MSF, and USF and is not found to be significant with respect to LSF.
this work, we predict the fault-prone classes using the OO metrics design suite given by
Bansiya and Davis (2002) and Chidamber and Kemerer (1994), instead of static code met-
rics. The results are evaluated using AUC obtained from ROC analysis. Figure 7.20 pres-
ents the basic elements of the study (for details refer [Malhotra and Raje 2014]).
  This section presents the evaluation results of the various ML techniques for fault pre-
diction using selected OO metrics given in Table 7.26.
  The results are validated using six releases of the “MMS” application package of the
Android OS. The six releases of Android OS have been selected with three code names,
                        Independent                Dependent
                          variables:                variable:
                                                                                  Learner:
                           OO metrics                 Defect
                                                                                   18 ML
                    • Bansiya and Davis             proneness
                                                                                 techniques
                      metrics
                                               • Probability of
                    • Chidamber and
                                                 occurrence of
                      Kemerer metrics
                                                 defect in a class
FIGURE 7.20
Elements of empirical study.
TABLE 7.26
Description of OO Metrics Used in the Study
Abb.                             Metric                                                Definition
WMC         Weighted methods per class                               Count of sum of complexities of the number of
                                                                      methods in a class.
NOC         Number of children                                       Number of subclasses of a given class.
DIT         Depth of inheritance tree                                Provides the maximum steps from the root to
                                                                      the leaf node.
LCOM        Lack of cohesion in methods                              Null pairs not having common attributes.
CBO         Coupling between objects                                 Number of classes to which a class is coupled.
RFC         Response for a class                                     Number of external and internal methods in a
                                                                      class.
DAM         Data access metric                                       Ratio of the number of private (and/or
                                                                      protected) attributes to the total number of
                                                                      attributes in a class.
MOA         Measure of aggression                                    Percentage of data declarations (user defined)
                                                                      in a class.
MFA         Method of functional abstraction                         Ratio of total number of inherited methods to
                                                                      the number of methods in a class.
CAM         Cohesion among the methods of a class                    Computes method similarity based on their
                                                                      signatures.
AMC         Average method complexity                                Computed using McCabe’s cyclomatic
                                                                      complexity method.
LCOM3       Lack of cohesion in methods                              Revision of LCOM metric given by
                                                                      Henderson-Sellers
LOC         Line of code                                             Number of lines of source code of a given class.
                                                                                                          (Continued)
Model Development and Interpretation                                                                           317
namely—Ginger Bread, Ice Cream Sandwich, and Jelly Bean. The source code of these
releases has been obtained from Google’s Git repository (https://android.googlesource.
com). The source code of Android OS is available in various application packages.
  The results of the ML techniques are compared by first applying the Friedman test, followed
by post hoc Wilcoxon signed-rank test if the results in the Friedman test are significant. The
predicted models are validated using tenfold cross-validation. Further, the predictive capa-
bilities of the ML techniques are evaluated using the across-release validation. To answer the
research questions given below, an empirical validation is done using various techniques on
the six releases of the Android OS using the following steps:
The models are generated using all the independent variables selected using the CFS tech-
nique. The results obtained using the reduced set of variables are slightly better as compared
to the results obtained using all the independent variables. Table 7.27 presents the relevant
metrics found in each release of Android data set after applying the CFS technique. The
results show that Ce, LOC, LCOM3, cohesion among methods (CAM), and data access metric
(DAM) are the most commonly selected OO metrics over the six releases of the Android data
sets.
  After this, the ML techniques are empirically compared, and the results are evaluated
in terms of the AUC. The AUC has been advocated as a primary indicator of comparative
performance of the predicted models (Lessmann et al. 2008). The AUC measure can deal
with noisy and unbalanced data and is insensitive to the changes in the class distributions
(De Carvalho et al. 2008). Table 7.28 reports the tenfold cross-validation results of 18 ML
techniques on six releases of Android OS. The ML technique yielding best AUC for a given
318                                                   Empirical Research in Software Engineering
                     TABLE 7.27
                     Relevant OO Metrics
                     Release                       Relevant Features
      TABLE 7.28
      Tenfold Cross-Validation Results of 18 ML Techniques with Respect to AUC
                                       Android Data Set Release
      ML Tech.      2.3.7      4.0.2       4.0.4       4.1.2       4.2.2   4.3.1     Avg.
      LR             0.81       0.66        0.85        0.73       0.56    0.68      0.72
      NB             0.81       0.73        0.84        0.76       0.62    0.80      0.76
      BN             0.79       0.46        0.84        0.73       0.43    0.52      0.63
      MLP            0.79       0.71        0.85        0.71       0.61    0.76      0.74
      RBF            0.77       0.76        0.80        0.74       0.76    0.74      0.77
      SVM            0.64       0.50        0.76        0.50       0.50    0.50      0.57
      VP             0.66       0.59        0.67        0.56       0.50    0.50      0.58
      CART           0.77       0.45        0.75        0.74       0.43    0.45      0.60
      J48            0.71       0.48        0.78        0.67       0.43    0.52      0.60
      ADT            0.81       0.72        0.83        0.72       0.62    0.74      0.74
      Bag            0.81       0.68        0.84        0.74       0.68    0.72      0.75
      RF             0.79       0.65        0.82        0.67       0.70    0.73      0.73
      LMT            0.77       0.66        0.83        0.75       0.56    0.73      0.72
      LB             0.83       0.75        0.82        0.71       0.70    0.65      0.75
      AB             0.81       0.70        0.81        0.69       0.70    0.65      0.73
      NNge           0.69       0.53        0.75        0.66       0.65    0.51      0.64
      DTNB           0.76       0.46        0.81        0.71       0.43    0.68      0.65
      VFI            0.77       0.72        0.70        0.62       0.75    0.74      0.72
release is depicted in bold. The results show that the model predicted using the NB, AB,
RBF, Bag, ADT, MLP, LB, and RF techniques have AUC greater than 0.7 corresponding to
most of the releases of the Android data set.
To confirm that the performance difference among the ML models is not random,
Friedman test is used to evaluate the superiority of one ML technique over the other
Model Development and Interpretation                                                     319
                  TABLE 7.29
                  Friedman Test Results
                  ML Tech.      Mean Rank       ML Tech.       Mean Rank
                  NB                3.58          RF               8.58
                  Bag               5.67          VFI              9.08
                  RBF               5.67          BN              10.83
                  ADT               5.92          DTNB            12.83
                  LB                7.17          NNge            13.58
                  MLP               7.25          CART            13.92
                  LR                7.08          J48             14.42
                  LMT               7.92          SVM             15.67
                  AB                8.17          VP              15.67
ML techniques. The Friedman test resulted in significant value of zero. The results are
significant at the 0.05  level of significance over 17  degrees of freedom. Thus, the null
hypothesis that all the ML techniques have similar performance in terms of AUC is
rejected. The results given in Table 7.29 show that NB technique is the best for predict-
ing fault proneness of a class using OO metrics. The result supports the finding of
Menzies et al. (2007) that NB is the best technique for building fault prediction models.
It can be also seen that the models predicted using SVM-based techniques, SVM and
VP, performed worst.
  RQ2: Which is the best ML technique for fault prediction using OO metrics?
  A2: The outcome of the Friedman test indicates that the performance of the NB tech-
    nique for fault prediction is the best. The performance of the Bagging and RBF
    techniques for fault prediction are the second best among the 18 ML techniques
    that were compared.
After obtaining significant results using the Friedman test, post hoc analysis was per-
formed using the Wilcoxon test. The Wilcoxon test is used to examine the statistical dif-
ference between the pairs of different ML techniques (see Section 6.4.10). The results of the
pairwise comparisons of the ML techniques are shown in Table 7.30.
  The results of Wilcoxon test show that out of the 18  ML techniques, the NB model is
significantly better than the models predicted using 17 ML techniques such as LMT, BN,
DTNB, NNge, CART, J48, SVM, and VP. Similarly, the VP model is significantly worse
than models developed using NB, Bag, RBF, ADT, LB, MLP, LR, LMT, AB, RF, and VFI
techniques, worse than the BN, DTNB, NNge, CART, and J48 techniques, and better than
the SVM model.
  Figure  7.21  shows the number of ML techniques from which the performance of a
given ML technique is either superior, significantly superior, inferior, or significantly
inferior. For example, from the bar chart shown in Figure  7.21, it can be seen that the
performance of the NB technique is significantly superior to eight other ML techniques
and nonsignificantly superior to nine  other techniques. Similarly, the performance of
the Bagging technique is significantly superior to seven  other ML techniques, nonsig-
nificantly superior to eight other techniques, and nonsignificantly inferior to two other
ML techniques.
                                                                                                                                         320
TABLE 7.30
Wilcoxon Test Results
            NB    Bag   RBF    ADT     LB    MLP     LR   LMT     AB    RF   VFI    BN   DTNB     NNge     CRT     J48   SVM    VP
NB                 ↑      ↑      ↑      ↑      ↑      ↑     ↑      ↑     ↑     ↑     ↑      ↑        ↑       ↑      ↑     ↑      ↑
Bag          ↓            ↓      ↑      ↑      ↑      ↑     ↑      ↑     ↑     ↑     ↑      ↑        ↑       ↑      ↑     ↑      ↑
RBF          ↓     ↑             ↑      ↑      ↑      ↑     ↑      ↑     ↑     ↑     ↑      ↑        ↑       ↑      ↑     ↑      ↑
ADT          ↓     ↓      ↓             ↓      ↑      ↑     ↑      ↑     ↑     ↑     ↑      ↑        ↑       ↑      ↑     ↑      ↑
LB           ↓     ↓      ↓      ↑             ↑      ↑     ↑      ↑     ↑     ↑     ↑      ↑        ↑       ↑      ↑     ↑      ↑
MLP          ↓     ↓      ↓      ↓      ↓             ↑     ↑      ↑     ↑     ↑     ↑      ↑        ↑       ↑      ↑     ↑      ↑
LR           ↓     ↓      ↓      ↓      ↓      ↓            ↓      ↓     ↑     ↓     ↑      ↑        ↑       ↑      ↑     ↑      ↑
LMT          ↓     ↓      ↓      ↓      ↓      ↓      ↑            ↑     ↓     ↓     ↑      ↑        ↑       ↑      ↑     ↑      ↑
AB           ↓     ↓      ↓      ↓      ↓      ↓      ↑     ↓            ↑     ↑     ↓      ↑        ↑       ↑      ↑     ↑      ↑
RF           ↓     ↓      ↓      ↓      ↓      ↓      ↓     ↑      ↓           ↑     ↑      ↑        ↑       ↑      ↑     ↑      ↑
VFI          ↓     ↓      ↓      ↓      ↓      ↓      ↑     ↑      ↓     ↓           ↑      ↑        ↑       ↑      ↑     ↑      ↑
BN           ↓     ↓      ↓      ↓      ↓      ↓      ↓     ↓      ↑     ↓     ↓            ↑        ↑       ↑      ↑     ↑      ↑
DTNB         ↓     ↓      ↓      ↓      ↓      ↓      ↓     ↓      ↓     ↓     ↓     ↓               ↑       ↑      ↑     ↑      ↑
NNge         ↓     ↓      ↓      ↓      ↓      ↓      ↓     ↓      ↓     ↓     ↓     ↓      ↓                ↑      ↑     ↑      ↑
CART         ↓     ↓      ↓      ↓      ↓      ↓      ↓     ↓      ↓     ↓     ↓     ↓      ↓        ↓              =     ↑      ↑
J48          ↓     ↓      ↓      ↓      ↓      ↓      ↓     ↓      ↓     ↓     ↓     ↓      ↓        ↓       =            ↑      ↑
SVM          ↓     ↓      ↓      ↓      ↓      ↓      ↓     ↓      ↓     ↓     ↓     ↓      ↓        ↓       ↓      ↓            ↓
VP           ↓     ↓      ↓      ↓      ↓      ↓      ↓     ↓      ↓     ↓     ↓     ↓      ↓        ↓       ↓      ↓     ↑
↑ implies performance of ML technique is significantly better than the compared ML technique, ↑ implies performance of ML technique is
better than the compared ML technique, ↓ implies performance of ML technique is significantly worse than the compared ML technique, ↓
implies performance of ML technique is worse than the compared ML technique, = implies performance is equal.
                                                                                                                                         Empirical Research in Software Engineering
Model Development and Interpretation                                                                                                             321
                                    15                                                                                   Superior
                                                                                                                         Sig. superior
                                    11
                                                                                                                         Inferior
                                     7                                                                                   Sig. inferior
               No. of techniques
−1
−5
−9
−13
                                   −17
                                         NB
                                              Bag
                                                    RBF
                                                          ADT
                                                                LB
                                                                     MLP
                                                                           LR
                                                                                LMT
                                                                                      AB
                                                                                           RF
                                                                                                VFI
                                                                                                      BN
                                                                                                           DTNB
                                                                                                                  NNge
                                                                                                                         CART
                                                                                                                                J48
                                                                                                                                      SVM
                                                                                                                                            VP
                                                                      Machine learning technique
FIGURE 7.21
Results of Wilcoxon test.
  The AUC values of the NB model are between 0.73 and 0.85  in five releases of the
Android data sets. The results in this study confirm the previous findings that the NB
technique is effective in fault prediction and may be used by researchers and practitioners
in future applications. The NB technique is based on the assumption that the attributes
are independent and unrelated. One of the reasons that the NB technique showed the best
performance is that the features are reduced using the CFS method before applying the
model prediction techniques in this work. The CFS method removes the features that are
correlated with each other and retains the features that are correlated with the dependent
variable. Hence, OO metrics selected by the CFS method for each data set are less corre-
lated with each other and more correlated with the fault variable. The NB technique is easy
to understand and interpret (linear model can be obtained as a sum of logs) and is also
computationally efficient (Friedman 1940; Zhou and Leung 2006). The NB technique is not
able to retain the results in one release of the Android data set (Android 4.2.2). This may
be because of the reason that the NB technique is not able to make accurate predictions of
faults on the basis of only one OO metric (DAM).
   RQ3: Which pairs of ML techniques are significantly different from each other for
     fault prediction?
   A3: There are 112  pairs of ML techniques that yield significantly different perfor-
     mance results in terms of AUC. The results show that the performance of the NB
     model is significantly better than BN, LMT, BN, DTNB, NNge, CRT, J48, SVM,
     and VP. Similarly, significant pairs of performance of the other ML techniques are
     given in Table 7.30.
validation show that the AUC of NB, RBF, ADT, Bagging, LMT, and AB are greater than
0.7 in most of the releases of Android.
  Figure  7.22  depicts the comparison of overall results of 18  ML techniques in terms of
the average AUC using both tenfold and across-release validation over all the Android
releases. The chart shows that the overall performance results obtained from the across-
release validation are better or comparable than the results obtained from the tenfold
cross-validation, except when Android 4.0.4 is validated using Android 4.0.2. One possible
 TABLE 7.31
 Across-Release Validation Results of 18 ML Techniques with Respect to AUC
                                                                   Android
 ML Tech.      2.3.7 on 4.0.2                4.0.2 on 4.0.4        4.0.4 on 4.1.2        4.1.2 on 4.2.2      4.2.2 on 4.3.1   Avg.
 LR                    0.81                         0.80                0.80                 0.66                 0.58        0.73
 NB                    0.82                         0.80                0.79                 0.68                 0.70        0.76
 BN                    0.85                         0.50                0.79                 0.63                 0.50        0.66
 MLP                   0.84                         0.82                0.81                 0.66                 0.60        0.75
 RBF                   0.82                         0.76                0.78                 0.72                 0.80        0.78
 SVM                   0.68                         0.50                0.71                 0.50                 0.50        0.58
 VP                    0.72                         0.50                0.58                 0.50                 0.50        0.56
 CART                  0.80                         0.50                0.70                 0.63                 0.50        0.63
 J48                   0.84                         0.50                0.77                 0.63                 0.50        0.65
 ADT                   0.83                         0.69                0.81                 0.74                 0.74        0.77
 Bag                   0.85                         0.73                0.79                 0.72                 0.81        0.78
 RF                    0.84                         0.57                0.80                 0.71                 0.65        0.72
 LMT                   0.81                         0.77                0.80                 0.74                 0.58        0.74
 LB                    0.85                         0.69                0.81                 0.69                 0.80        0.77
 AB                    0.82                         0.78                0.78                 0.66                 0.80        0.77
 NNge                  0.78                         0.54                0.71                 0.65                 0.56        0.65
 DTNB                  0.80                         0.51                0.77                 0.63                 0.50        0.65
 VFI                   0.85                         0.79                0.73                 0.59                 0.78        0.75
                                0.9
                                0.8
                                0.7
                  Average AUC
                                0.6
                                0.5
                                0.4
                                0.3
                                0.2
                                0.1
                                0.0
                                      d)
d)
d)
d)
                                                                                                           d)
                                                e)
e)
e)
e)
                                                                                                                  e)
                                               as
as
as
as
                                                                                                                 as
                                    ol
ol
ol
ol
                                                                                                         ol
                                  nf
le
nf
le
nf
le
nf
le
nf
                                                                                                                le
                                           re
re
re
re
                                                                                                             re
                                (te
(te
(te
(te
                                                                                                    (te
                                           s-
s-
s-
s-
                                                                                                             s-
                                2
                                                                                                    1
                                        os
os
os
os
                                                                                                            os
                          0.
0.
1.
2.
                                                                                                  3.
                                      cr
cr
cr
cr
                                                                                                          cr
                      4.
4.
4.
4.
                                                                                                 4.
                                      (a
(a
(a
(a
                                                                                                        (a
                  d
                                                                                                d
                   i
                                                                                                i
                                 2
                                                                                                       1
                ro
ro
ro
ro
                                                                                             ro
                                0.
0.
1.
2.
                                                                                                    3.
               nd
nd
nd
nd
                                                                                           nd
                                4.
4.
4.
4.
                                                                                                    4.
              A
                                                                                         A
                          d
                                                                                                  d
                       i
                                                                                                   i
                    ro
ro
ro
ro
                                                                                                ro
                  nd
nd
nd
nd
                                                                                             nd
                A
FIGURE 7.22
Comparison between AUC results of tenfold and across-release validation for five releases of android data set.
Model Development and Interpretation                                                            323
explanation to this is that the values of OO metrics in the Android releases are informative
enough to predict faults in the subsequent releases. The reason for the low AUC values for
across-release validation as compared to the AUC values for tenfold cross-validation in
case of Android 4.0.4 could be that the faulty class percentage in Android 4.0.2 is very less
(5.47%) as compared to the faulty class percentage in Android 4.0.4 (33.01%).
Exercises
  7.1 Briefly outline the steps of model prediction.
  7.2 What is multicollinearity? How can it be removed?
  7.3 What is ML? Define various categories of ML technique?
  7.4 Discuss the guidelines for selecting ML techniques.
  7.5 It is difficult to assess the accuracy of a model where most of the outcomes are nega-
      tives. In such cases, what criteria will you use to determine the accuracy of the model?
  7.6 Consider two models predicted using tenfold cross-validation. The error rate pro-
      duced by model1 is 32, 15, 14, 20, 35, 45, 48, 52, 27, and 29, and model2 is 20, 14, 10, 8,
      15, 20, 25, 17, 19, and 7. We want to determine which model performance is signifi-
      cantly better than the other at 0.01 significance level. Apply appropriate statistical
      test and provide interpretation of the results.
  7.7 How can bias and variance be reduced for a given model?
  7.8 What is the difference between underfitting and overfitting?
  7.9 Which measures are useful in predicting model performance when data is
      imbalanced?
  7.10 How will a researcher decide on the selection of learning technique?
  7.11 Consider the model with following predicted values. Given the actual values,
      comment on the performance of the model.
                          0               0.34         1         0.34
                          1               0.78         1         0.82
                          0               0.23         0         0.21
                          0               0.46         1         0.56
                          0               0.52         0         0.61
                          1               0.86         0         0.21
                          1               0.92         1         0.76
                          1               0.68         1         0.56
                          0               0.87         0         0.10
324                                               Empirical Research in Software Engineering
Further Readings
The statistical methods and concepts are effectively addressed in:
This book emphasizes problem-solving strategies that address the many issues arising
when developing multivariable models using real data with examples:
The following book helps to explore the benefits in data mining that DTs offer:
  L. Rokach, and O. Maimon, Data Mining with Decision Trees: Theory and Applications,
     Series in Machine Perception and Artificial Intelligence, World Scientific,
     Singapore, 2007.
Model Development and Interpretation                                                        325
The detail about the classification and regression trees is presented in:
For a detailed account of the statistics needed for model prediction using LR (notably how
to compute maximum likelihood estimates, R2, significance values), see the following text
book and research paper:
This research paper by Webb and Zheng presents the framework for multistrategy ensem-
ble learning techniques:
L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
In this book, the authors explain the basic concepts of NN and then show how these mod-
els can be applied to applications:
  D. F. Specht, “Probabilistic neural networks,” Neural Networks, vol. 3, no. 1, pp. 109–118,
     1990.
The authors present the basic ideas of SVM together with the latest developments and cur-
rent research questions in:
  I. Steinwart, and A. Christmann, Support Vector Machines, Springer Science & Business
      Media, New York, 2008.
The concept of search-based techniques along with its practical applications and guide-
lines in software engineering are effectively presented by Harman et al. in:
  M. Harman, “The relationship between search based software engineering and pre-
    dictive modeling,” Proceeding of 6th International Conference on Predictive Models in
    Software Engineering, Timisoara, Romania, 2010.
  M. Harman, “The current state and future of search based software engineering,” In:
    Proceedings of Future of Software Engineering, Washington, pp. 342–357, 2007.
  M. Harman, S. A. Mansouri, and Y. Zhang, “Search based software engineering:
    A comprehensive analysis and review of trends techniques and applications,”
    Technical Report TR-09-03, Department of Computer Science, King’s College
    London, London, 2009.
  M. Harman, P. McMinn, J. T. D. Souza, and S. Yoo, “Search based software engi-
    neering: Techniques, taxonomy, tutorial,” In: Empirical Software Engineering and
    Verification, Springer, Berlin, Germany, pp. 1–59, 2012.
  M. Harman, and P. McMinn, “A theoretical and empirical study of search-based test-
    ing: Local, global, and hybrid search,” IEEE Transactions on Software Engineering,
    vol. 36, no. 2, pp. 226–247, 2010.
  Y. Li, “Selective voting for perceptron-like online learning,” In: Proceedings of the 17th
     International Conference on Machine Learning, San Francisco, pp. 559–566, 2000.
This research paper provides knowledge about detecting multicollinearity and also dis-
cusses how to deal with the associated problems:
The following research paper introduces and explains methods for comparing the perfor-
mance of classification algorithms:
Yang addresses general issues in categorical data analysis with some advanced meth-
ods in:
Davis and Goadrich presents the deep connection between ROC space and PR space in the
following paper:
This research paper presents the most powerful nonparametric statistical tests to carry out
multiple comparisons using accuracy and interpretability:
This research paper serves as an introduction to ROC graphs and as a guide for using
them in research:
  T. Fawcett, “An introduction to ROC analysis,” Pattern Recognition Letters, vol. 27,
     no. 8, pp. 861–874, 2006.
The root-mean-square error (RMSE) and the MAE to describe average model performance
error are examined in:
  C. J. Willmott, and K. Matsuura, “Advantages of the mean absolute error (MAE) over
     the root mean square error (RMSE) in assessing average model performance,”
     Climate Research, vol. 30, no. 1, p. 79, 2005.
The validity of the results is an important concern for any empirical study. The results
of any empirical research must be valid for the population from which the samples are
drawn. The samples are derived and generalized to the population that the researcher
decides. Threats reduce the applicability of the research results. Hence, the researcher must
identify the extent of validity of results in design stage and must also provide a list of
threats to the validity of results after the results have been analyzed. This will provide the
readers complete information about the limitations of the study.
  In this chapter, we present categories of threat to validity, explain the threats with
examples, and also list various threats identified from fault prediction studies. We also
provide possible mitigation of these threats.
                                                             Conclusion
                                             Internal
                                                              Internal
                             Threats
                                                             Construct
                                             External
                                                              External
FIGURE 8.1
Categorization of threats.
the chosen statistical test (like the size of the sample, normality of the data, etc.) should
be fulfilled. Researchers should apply nonparametric tests in cases where they are not
completely sure that their data fulfills all the conditions of a parametric statistical test to
eliminate conclusion validity. It may be noted that some researchers address this threat as
“statistical conclusion validity.” The various possible threats to conclusion validity are as
follows (Cook and Campbell 1979; Wohlin et al. 2012):
  • Use of immature subjects: If the experimental data is collected from groups that
    are not true representatives of industrial settings. For example, if the results
    are based on the samples collected from software developed by undergraduate
    students, then this may pose a serious threat to conclusion validity.
  • Reliability of techniques applied: The settings of the algorithms or techniques should
    be standard and not be overtuned according to the data as this may produce overfit-
    ted results.
  • Heterogeneity of validation samples or case studies: The samples should not be
    heterogeneous as then the variation in the results will be more influenced by the envi-
    ronment and nature of the samples rather than the techniques applied. However, this
    will pose a threat to generalizability and hence decrease external validity of results.
  • Lack of expert evaluation: The conclusions or results should be evaluated by an
    expert to understand and interpret their true meaning and significance. Lack of
    expert judgment may lead to erroneous conclusions.
  • Variety of data preprocessing or engineering activities not taken into account: An
    experiment involves a wide range of data preprocessing or other activities such as
    scaling, discretization, and so on. These activities can significantly influence the
    results if not properly taken into account.
Internal validity is the degree to which we can strongly conclude that the causes/changes
in dependent variable B are because of only the independent variable A.
  Internal validity concerns itself with all the possible factors except the independent
variables of the study, which can cause the observed outcome (Neto and Conte 2013).
Apart from the independent variables of a study, there could be other (“confound-
ing”) factors, which cannot be controlled by the researcher. Such extraneous variables
are confounding variables, which may be correlated with the independent and/or the
dependent variable. Therefore, such variables pose a threat to the results of the study
as they could be responsible for the “causal” effect of the independent variables on the
dependent variable (Wohlin et  al. 2012). For example, a researcher who would like to
investigate the relationship between object-oriented (OO) metrics and the probability
that a particular class would be faulty or not should consider the confounding effect
of class size on these associations. This is important, as a researcher cannot control the
size of a class. Moreover, class size is correlated with the probability of faults, as larger
classes tend to have more number of faults because of their size. Thus, size may affect
334                                               Empirical Research in Software Engineering
the actual relationship between other OO metrics like coupling, cohesion, and so on and
could be responsible for their “causal” relationship with the fault-prone nature of a class.
Internal validity thus accounts for factors that are not controlled by the researcher and
may falsely contribute to a relationship. The threats related to experimental settings also
are part of the internal validity. The various possible internal validity threats that com-
monly occur in empirical studies are as follows:
intend to measure. Hence, this validity poses a threat if the researcher has not accurately
and correctly represented the variables (independent and dependent) of the study.
For  example, the coupling attribute of a class (theoretical concept) in an OO software
may be represented by a measure that correctly and precisely counts the number of other
classes to which a particular class is interrelated. Similarly, a researcher who wishes to
investigate the relationship between OO metrics and fault-prone nature of a class should
primarily evaluate that all the selected metrics of the study are correct and effective
indicators of concepts like coupling, cohesion, and so on. The bugs collected from a bug
repository can represent the theoretical concept “fault.” However, this bug data should be
collected carefully and exhaustively with correct mapping to remove any unbiased repre-
sentation of the faults in the classes. If the bugs are not properly collected, it may lead to
an incorrect dependent variable. Thus, both the independent and the dependent variables
should be carefully verified for use in experiments to eliminate the threats to construct
validity. The various possible threats to construct validity are as follows:
of results in those situations that are different from the subjects and settings of the study.
For example, a study which evaluates the effectiveness of machine learning methods to iden-
tify fault-prone classes using data mined from open source software would have external
validity concerns regarding whether the results computed using machine learning algo-
rithms on a given data set hold true for industrial software data sets or for other open-source
data sets. A study with high external validity is favorable as its conclusions can be broadly
applied in different scenarios, which are valid across the study domain (Wright et al. 2010).
Thus, the results obtained from a study with more number of data sets of different size and
nature and recomputation of results using varied algorithms will help in establishing well-
formed theories and generalized results. Such results will form widely acceptable and
well-formed conclusions. The various possible threats to external validity are as follows:
The threats to external validity can be reduced by clearly describing the experimental
settings and techniques used. The study must be carried out with the intent to enable
researchers to repeat and replicate the study being carried out.
  • The study uses public domain NASA data set KC1. Thus, the data set is verified
    and trustworthy, and does not contain erroneous observations as it is developed
    following best industrial practices in NASA.
  • The study uses well-formed hypothesis to ascertain the relationship between OO
    metrics and fault proneness. Moreover, the values of statistical significance levels
    (0.01 and 0.05) used during correlation analysis as well as univariate and multi-
    variate analysis increases the confidence in the conclusions of the study.
  • The study uses tenfold cross-validation results that are widely acceptable methods
    in research (Pai and Bechta Dugan 2007; De Carvalho et al. 2010) for yielding con-
    clusive results. Thus, reducing threat to conclusion validity.
  • A data set is said to be imbalanced, if the class distribution of faulty and nonfaulty
    classes is nonuniform. A number of literature studies (Lessmann et  al. 2008;
    Menzies et  al. 2010) advocate the use of receiver operating characteristic (ROC)
    analysis as a competent measure for assessing unbalanced data sets. Thus, the use
    of ROC analysis avoids threats to conclusion validity.
338                                                   Empirical Research in Software Engineering
TABLE 8.1
Details of Example Study
Data used            Description                     NASA data set KC1 (public domain)
                     Size                            145 classes, 40K lines of code
                     Language                        C++
                     Distribution                    Faulty classes: 59
                                                     Nonfaulty classes: 86
                     Descriptive statistics stated   Min, max, mean, median, standard deviation, 25%
                                                      quartile, and 75% quartile for each input metric
Independent          OO metrics                      Chidamber and Kemerer metrics and LOC
 variables
Dependent variable   Fault proneness                 Faults categorized into three severity levels: high,
                                                      medium, and low. A model was also created with
                                                      ungraded fault severity
                     Distribution according to       High severity         23 classes
                      fault severity                 Medium severity 58 classes
                                                     Low severity          39 classes
Preprocessing        Outlier detection               Detected univariate and multivariate outliers (using
 performed                                            Mahalanobis Jackknife distance)
                     Input metrics normalization     Using min–max normalization
                     Correlation analysis            Correlation coefficient values among different metrics
                                                      analyzed. Significance level: 0.01
                     Multicollinearity analysis      Conditional number using principal component
                                                      method is <30
Algorithms used      LR                              Univariate LR
                                                     Multivariate LR
                     Machine learning                DT
                                                     ANN
Algorithm settings   DT                              Chi-square automatic interaction detection (CHAID)
                                                      algorithm
                     ANN                             Architecture          3 layers
                                                                           7 input units
                                                                           15 hidden units
                                                                           1 output unit
                                                     Training              Tansig transfer function
                                                                           Back propagation algorithm
                                                                           TrainBR function
                                                                           Learning rate 0.005
Model evaluation     Performance metrics             Sensitivity
                                                     Specificity
                                                     Completeness
                                                     Precision
                                                     ROC analysis
                     Statistics reported for         Coefficient (B), standard error (SE), statistical
                      univariate LR and               significance (Sig.), odds ratio (Exp [B]), R2 statistic.
                      multivariate LR analyses        Significance level: 0.01 and 0.05
Model development    Feature reduction               Univariate analysis
                     Validation method               Tenfold cross-validation
Validity Threats                                                                           339
  • Outliers are unusual data points that may pose a threat to the conclusions of the
    study by producing bias. The study reduces this threat by performing outlier
    detection using univariate and multivariate analysis.
  • The researchers do not have control over class size, and thus class size can act as a
    confounding variable in the relationship between OO metrics and fault proneness
    of a class. However, the study included the lines of code (LOC) metric (a measure
    of class size) as an independent variable in the analysis. But, evaluating the con-
    founding effect of class was beyond the scope of the study. Thus, this threat to
    internal validity exists in the study.
  • The study also examined correlation among different metrics, and it is seen that
    some independent variables are correlated among themselves. However, this
    threat to internal validity was reduced by performing multicollinearity analysis,
    where the conditional number was found to be <30 indicating effective interpreta-
    tion of the predicted models as the individual effect of independent variables can
    be effectively assessed.
  • A number of studies, which evaluate fault proneness of a class, do not take into
    account the severity of the faults. This is a possible threat to internal validity.
    However, this study accounts for three severity levels of faults.
  • The study does not take into account and control the effect of programmer’s capa-
    bility/training and experience in model prediction at various severity levels of
    faults. Thus, this threat exists in the study.
  • The association of defects with each class according to their severity was done
    very carefully to provide an accurate representation of fault-prone nature and fault
    severity. Moreover, the faults were divided into three severity levels: high, medium,
    and low, so that medium-severity level of faults can be given more attention and
    resources than a low-level fault. An earlier study by Zhou and Leung (2006) divided
    faults only into two severity levels: high and low. They combined both medium-
    and low-severity faults in the low category. This was a possible threat to construct
    validity, as medium-severity faults are more critical and should be prioritized over
    low-severity faults. However, this threat was removed in this study.
  • The metrics used in the study are widely used and established metrics in the literature.
    Thus, they accurately represent the concepts they propose to measure. Moreover, the
    selected metrics are representative of all OO concepts like depth of inheritance tree
    (DIT) and number of children (NOC) metrics for inheritance, lack of cohesion in
    methods (LCOM) metric for cohesion, coupling between object (CBO) metric for cou-
    pling, weighted methods per class (WMC) metric for complexity, and response for
340                                                Empirical Research in Software Engineering
    a class (RFC) and LOC metrics for size. Thus, the selected metric suite reduces the
    threat to construct validity by accurately and properly representing all OO concepts.
  • The mapping of faults to their corresponding classes is done carefully. However,
    there could be an error in this mapping, which poses a threat to construct validity.
  • The metrics and severity of faults for NASA data set KC1 are publically available.
    However, we are not aware as to how they were calculated. Thus, the accuracy of
    the metrics and severity levels of faults cannot be confirmed. This is a possible
    threat to construct validity.
  • The data set used is publically available KC1 data from NASA metrics data pro-
    gram. Since the data set is publically available, repeated and replicated studies are
    easy to perform increasing the generalizability of results. As discussed by Menzies
    et al. (2007), NASA uses contractors that are obliged by contract (ISO-9001) to dem-
    onstrate the understanding and use of current best industrial practices.
  • The results of the study are limited to the investigated complexity metrics (Chidamber
    and Kemerer [CK] metrics and LOC) and modeling techniques (logistic regression
    [LR], decision tree [DT], and artificial neural network [ANN]). However, the selected
    metrics and techniques are widely used in literature and well established. Thus, the
    choice of such metrics and techniques does not limit the generalizability of the results.
  • Fault severity rating in KC1 data set may be subjective. Thus, may limit the gener-
    alizability of study results.
  • Data sets developed using other programming languages (e.g., Java) have not been
    explored. Thus, replicated and repeated studies with different data sets are impor-
    tant to establish widely acceptable results.
  • The conclusions of the study are only specific to fault-proneness attribute of a class
    and the results of the study do not claim anything about the maintainability or
    effort attributes.
  • The researchers have completely specified the parameter setting for each algorithm
    used in the study. This increases the generalizability of the results as researchers
    can easily perform replicated studies.
  • The study uses ten fold cross-validation technique that uses ten iterations (the whole
    data set is partitioned into ten subsets, each iteration uses nine partitions for training
    and the tenth partition for validating the model and this process is repeated 10 times).
    Thus, the use of tenfold cross-validation increases the generalizability of our results.
  • The study states the descriptive statistics of the data set used in the study. These
    descriptive statistics gives other researchers an insight into the properties of data
    sets. Researchers can thus effectively use the results of the study on similar types
    of data sets effectively.
  • There is only one data set used in the study. This poses a threat to the generaliz-
    ability of results. However, the data set used is an industrial data set developed by
    experienced developers. Thus, the results obtained may be applied for software
    industrial practices.
Validity Threats                                                                                 341
TABLE 8.2
Fault Prediction Studies
Study                               Study                             Study
No.            Reference             No.          Reference            No.         Reference
S1      El Emam et al. 1999          S20    Aggarwal et al. 2009       S40    Al Dallal 2012a
S2      Briand et al. 2000           S21    Catal and Deri 2009        S41    Al Dallal 2012b
S3      Glasberg et al. 2000         S22    Tosun et al. 2009          S42    Nair and Selverani 2012
S4      Briand et al. 2001           S23    Turhan and Bener 2009      S43    He et al. 2012
S5      El Emam et al. 2001          S24    Turhan et al. 2009         S44    Li et al. 2012
S6      El Emam et al. 2001a         S25    Zimmermann et al. 2009     S45    Ma et al. 2012
S7      Subramanyam and              S26    Afzal 2010                 S46    Mausa et al. 2012
         Krishnan 2003               S27    Ambros et al. 2010         S47    Okutan Vildiz 2012
S8      Zhou and Leung 2006          S28    Arisholm et al. 2010       S48    Pelayo and Dick 2012
S9      Aggarwal et al. 2007         S29    De Carvalho et al. 2010    S49    Rahman et al. 2012
S10     Kanmani et al. 2007          S30    Liu et al. 2010            S50    Rodriguez et al. 2012
S11     Menzies et al. 2007          S31    Menzies et al. 2010        S51    Canfora et al. 2013
S12     Oral and Bener 2007          S32    Singh et al. 2010          S52    Chen et al. 2013
S13     Pai and Becthaduran 2007     S33    Zhou et al. 2010           S53    Herbold 2013
S14     Lian Shatnawi et al. 2007    S34    Al Dallal 2011             S54    Menzies et al. 2013
S15     Lessmann et al. 2008         S35    Elish et al. 2011          S55    Nam et al. 2013
S16     Marcus et al. 2008           S36    Kpodjedo et al. 2011       S56    Peters et al. 2013
S17     Moser et al. 2008            S37    De Martino et al. 2011
S18     Shatnawi and Li 2008         S38    Misirh et al. 2011
S19     Turhan et al. 2008           S39    Ambros et al. 2012
TABLE 8.3
                                                                                                                                                                                 342
Assumptions of       The data on which a particular statistical test or measure   A researcher should ensure that the conditions of a parametric             C: S26; S27; S28;
 statistical tests    is applied may not be appropriate for fulfilling the         statistical test are fulfilled or should use a nonparametric test          S29; S36; S38;
 not satisfied        conditions of the test. For example, application of          where the conditions necessary for using parametric tests do not           S39; S45; S48;
                      analysis of variance (ANOVA) test requires the               hold.                                                                      S51.
                      assumption of data normality and homogeneity of
                      variance, which may not be fulfilled by the underlying
                      data leading to erroneous conclusions.
Low statistical      Use of inappropriate significance level while                The researcher should choose an appropriate significance level for         C: S26; S27.
 test ability         conducting statistical tests. Choice of an inadequate        statistical tests such as 0.01 or 0.05 to conclude significant results.
                      significance level will lead to incorrect conclusions.
Validation bias      The predicted models may use the same data for               Use of cross-validation and multiple iterations to avoid sampling          C: S13; S15; S20;
                      training as well as validation leading to the possibility    bias. Also, to yield unbiased results, the training data should be         S21; S26; S29;
                      of biased results. May be termed as sampling bias.           different from testing data.                                               S38; S46; S47.
Inappropriate use    Inaccurate use of performance measures.                      The performance measure used for analyzing the results should              C: S21; S46.
 of performance                                                                    be appropriate to evaluate the studied data sets. For example, an
 measures                                                                          unbalanced data set should be evaluated using Area Under the
                                                                                   ROC curve (AUC) performance measure.
                     Inaccurate cutoffs for performance measure. For              Appropriate cutoff values for performance measure should be                N: S53.
                      example, a recall value of greater than 70% may be           justified and should be based on previous empirical studies and
                      considered appropriate for an effective model.               research experience.
Reliability of       Parameter tuning of algorithms was not done                  Tuning parameters of a particular technique improves the results.          C: S43.
 techniques           appropriately. The parameters of a specific algorithm        However, on the other hand, use of default parameters avoids
 applied              may be overtuned or undertuned leading to biased             overfitting. Thus, the parameters of algorithms should be
                      results.                                                     appropriately tuned.                                                      N: S28; S30; S50.
Lack of expert       An expert should evaluate the generated classification       An expert to understand their true significance or meaning                 N: S29.
 evaluation           rules to understand its true significance or meaning.        should evaluate the generated classification rules.
Wide range of data   A number of preprocessing or engineering activities          Different preprocessing techniques or engineering activities like          P: S15.
 preprocessing or     such as scaling, feature selection, and so on has not        scaling of data values, feature selection, and so on may improve
 engineering          been taken into account for their effect on results and      the performance of the classifier and should be accounted for in
 activities not       conclusions.                                                 the study. However, doing so is computationally infeasible.
 taken into                                                                        Thus, the selection of techniques and preprocessing activities
 account                                                                           should be such that the conclusions do not vary significantly.
                                                                                                                                                                                 Empirical Research in Software Engineering
TABLE 8.4
Internal Validity Threats
                                                                                                                                                  Studies that
                                                                                                                                                   Encounter
Threat Type                           Threat Description                                             Threat Mitigation                           These Threats
                                                                                                                                                                          Validity Threats
Confounding        Does not account for the confounding effect of class size on    The study should account for the confounding effect of        C: S6.
 effects of         the association between metrics and fault proneness.            class size on the relationship between metrics and fault
 variables                                                                          proneness by controlling class size and its effect on the    P: S34; S40; S41.
                                                                                    relationship.
                   Does not account for causality of association between           Controlled experiments where a specific measure such as       N: S1; S2; S4; S6;
                    specific metrics and dependent variable (fault proneness).      cohesion or coupling can be varied in a controlled            S10.
                                                                                    manner while keeping all other factors constant can
                                                                                    demonstrate causality. Such experiments are difficult to
                                                                                    perform in practice.
                   Does not account for noise in the data collected from a         As noise in a repository such as Bugzilla could affect the    N: S27.
                    specific repository. The noisy data can significantly affect    relationship between bugs and metrics, it should be
                    the relationship between metrics and defects.                   completely removed or the effect of noise should be
                                                                                    accounted for in the relationship between faults (bugs)
                                                                                                                                                 P: S39.
                                                                                    and metrics.
Response of        Certain techniques produce different results at different       Should execute multiple runs (>10 runs) of a random           N: S14; S18; S33;
 samples for a      times because of randomness of a particular technique.          algorithm to obtain unbiased results.                         S34 S41.
 given technique    For example, an algorithm like genetic algorithm
                    randomly selects its initial population. Thus, a single run
                    of a random algorithm may produce biased results.
Influence of       Does not account for programmer’s capability/training           If possible, the study should evaluate the effect of          N: S6; S7; S12; S42;
 human factors      and other human factors on the explored relationship of         programmer’s capability/training and other human              S45.
                    metrics and fault proneness.                                    factors on the cause–effect relationship of metrics and
                                                                                    fault proneness. However, data to evaluate such a
                                                                                    relationship is difficult to collect in practice.
                   Confounding effect of developer experience/domain               If possible, the study should evaluate the effect of          N: S3; S8; S12.
                    knowledge on the cause–effect relationship between              developer experience/domain knowledge on the
                    metrics and faults has not been accounted for.                  cause–effect relationship between metrics and faults.
                                                                                    However, data to evaluate such a relationship is difficult
                                                                                    to collect in practice.                                      P: S42.
                                                                                                                                                           (Continued )
                                                                                                                                                                          343
TABLE 8.4 (Continued)
                                                                                                                                                                          344
Misinterpretation of   Different OO metrics used as independent variables          The measures capturing the various attributes should be       N: S17.
 concepts and           may not be well understood or represented.                  well understood.                                             P: S2; S4.
 measures                                                                                                                                        C: S13; S21.
                       Inappropriate use of performance metrics for                The performance metrics used for evaluating the results       N: S28; S51.
                        representing the theoretical concepts or improper           should be carefully chosen and should minimize the gap       C: S26, S37.
                        computation of performance measure.                         between theory and experimental concepts. Also, the com-
                                                                                    puted performance measure should be verified by two
                                                                                    independent researchers so that the results are not biased
Unaccountability of    To compare different algorithms, different settings, and    To conduct a fair comparison among algorithms, a              N: S50.
 generalizability       internal details such as different number of rules and      baseline should be established to evaluate the number
 across related         different quality measures are not accounted for. This      of rules, choose an appropriate quality measure, and so
 constructs             leads to unfair comparison among different                  on. Only use of default parameters does not take into
                        algorithms.                                                 account better sets of rules.
                       Combining results with different metrics (OO and            Use sensitivity analysis to check whether the results are     C: S30; S48.
                        procedural) for multiple data set combinations may          biased.
                        produce inappropriate or biased results.
Reliability of         Use of an automated tool for extraction of metrics          Manual verification of the metrics/dependent variable         N: S18; S19.
 measurement tools      (independent variables) or dependent variable.              generated by the tool.                                       P: S17.
                                                                                                                                                 C: S4; S19; S25; S36;
                                                                                                                                                  S41; S47.
Improper data-         The process or data-collection methods for collection       Researchers should specifically verify the collection         N: S44.
 collection methods     of dependent variable may not be completely verified        process of the dependent variable. For example,
                        leading to improper data collection.                        acceptance testing activities for collection of faults.      C: S2; S13.
                       Mistakes in identification/assignment/classification        Manual verification of faults can minimize this threat.       N: S43; S53.
                        of faults.                                                                                                               C: S36; S52.
                       Does not account for measurement accuracy of the            A researcher should use appropriate data-collection           C: S15; S17; S30;
                        software attributes such as coupling, faults, and so on.    methods and verify them. Public data sets whose               S37; S45; S48; S49.
                                                                                    characteristics have already been validated in previous      P: S44; S51.
                                                                                    research studies may also be used to instill confidence
                                                                                    in software attributes data collection.
                                                                                                                                                                          345
                                                                                                                                                           (Continued )
TABLE 8.5 (Continued)
                                                                                                                                                             346
Inappropriate       Investigated software systems may not be             The size and number of classes in the evaluated data set      N: S34; S40; S41; S42.
 selection of        representative in terms of their size and number     may not be representative of industrial systems. Thus,
 subjects            to industrial scenario.                              researchers should be careful while selecting data sets so
                                                                          that their results might be generalizable to industrial
                                                                          settings.
                    In software industry, a personnel may not be         The selected software systems should have undergone a         C: S28.
                     available for all software development phases.       large number of organizational and personnel change to
                     An industrial software undergoes a large             generalize results to real industrial settings.
                     organizational and personnel change. It is
                     possible that the selected software system may
                     not have undergone a large personnel and
                     organization change as in an industrial scenario.
Applicability of    Use of monoprogramming language software             The software systems selected for evaluation should be        N: S1; S2; S5; S8; S16;
 results across      systems.                                             developed in different programming languages to yield         S19; S27; S32; S33; S34;
 languages                                                                generalized results across languages.                         S39; S40; S41; S49; S53;
                                                                                                                                        S56.
                                                                                                                                       C: S30.
Inadequate size     The number of software systems evaluated is low      A large number of data sets should be evaluated to assure     N: S1; S4; S5; S7; S8; S13;
 and number of       affecting the generalizability of the results.       universal application of results.                             S26; S35; S46; S53.
 samples                                                                                                                               P: S16; S19.
                                                                                                                                       C: S25; S47; S51; S56.
                    The size of evaluated software systems may not       The evaluated data sets should be industry sized and/or       N: S2; S4; S8; S9; S10;
                     be appropriate for industrial settings.              should be of varied sizes (small/medium/large).               S20; S29.
                                                                                                                                       P: S35.
                                                                                                                                       C: S21; S25; S28; S30; S36.
                    The software systems evaluated may be extracted      The evaluated software systems should be collected from       C: S27; S39.
                     from a single versioning system/bug tracking         different versioning systems/bug repositories to obtain
                     system. Thus, application of results to other        generalized results.
                     versioning/bug tracking systems is limited.
                                                                                                                                                      (Continued)
                                                                                                                                                                     347
TABLE 8.6 (Continued)
                                                                                                                                                                          348
Results bias           Model bias because of selection of specific           A number of learners should be evaluated to increase the          N: S46.
 because of             learners.                                             generalizability of results.                                     P: S11; S12; S13; S17; S24;
 techniques or                                                                                                                                  S33; S51; S56.
 subjective                                                                                                                                    C: S15.
 classification of a
                       Classification of faults into severity levels could   Researchers should clearly specify the criteria for labeling      N: S8; S14; S18; S32.
 variable
                        be subjective.                                        faults into different severity levels so that studies could be
                                                                              appropriately replicated. Inappropriate classification of
                                                                              faults may lead to biased results.
Nonspecification of    Nonspecification of data characteristics and          The data sets should be publically available for repeated         C: S11; S15; S30; S37;
 experimental           details for repeated studies.                         and replicated studies.                                           S44; S46.
 setting and
 relevant details
                                                                                                                                                                             349
350                                              Empirical Research in Software Engineering
categorized into three divisions, namely, N (not addressed the threat at all), P (partially
addressed the threat), and C (completely addressed the threat). Thus, all the studies that
encounter a particular threat and do not take any actions to mitigate its effect are catego-
rized in N category. Similarly, all the studies that try to address a specific threat but have
not been successful in completely removing its effect are categorized as P. Finally, all the
studies that take appropriate actions to mitigate the existence of threat as effectively as
possible are grouped into C.
Exercises
  8.1 Identify the categories to which the following threats belong:
      • Threat caused by not taking into account the effect of developer experience on
         the relationship between software metrics and fault proneness.
      • Threat caused by only exploring systems developed using Java language.
      • Threat caused by using the same data for testing and training.
      • Threat caused by investigating a not publically available data set.
      • Threat caused by exploring only open source systems.
      • Threat caused by considering inappropriate level of significance.
      • Threat caused by incomplete or imprecise data sets.
  8.2 Consider a study where the lines of code is mapped to various levels of com-
     plexity such as high, medium, and low. What kinds of threats the mapping will
     impose?
  8.3 Consider a systematic review where only journal papers are considered in the
     review. The review also uses an exclusion and inclusion protocol based on sub-
     jective judgment to select papers to be included. Identify the potential threats to
     validity.
  8.4 Compare and contrast conclusion and external threats to validity.
  8.5 What are validity threats? Why it is important to consider and report threats to
     validity in an empirical study?
  8.6 Consider the systematic review given in Section 8.3, identify the threats of valid-
     ity that exist in this study.
Further Readings
The concept of threats to validity is presented in:
This following research paper address external validity and raises the bar for empirical
software engineering research that analyzes software artifacts:
This following research paper provides a tradeoff between internal and external validity:
The impact of the assumptions made by an empirical study on the experimental design is
given in:
The goal of the research is not just to discover or analyze something but the results of the
study must be properly written in the form of research report or publication to enable
the results to be available to the intended audience—software engineers, researchers,
academicians, scientists, and sponsors. After the experiment has finished, the results
of the  experiment can be summarized for the intended audience. While reporting the
results, the research misconduct, especially plagiarism, must be taken care of.
  This chapter presents when and where to report the research results, provides guidelines
for reporting research, and summarizes the principles of research ethics and misconduct.
  • Allows presenting the methodology and the results to the outside world
  • Allows software engineering organizations to apply the findings of the research
    in the industrial environment
  • If the study is a systematic literature review, it will allow the researcher to have
    an idea of the current position of the research in the specific software engineer-
    ing area
                                                                                           353
354                                              Empirical Research in Software Engineering
   1. Type of research work: The work carried out by the research may be survey/review,
      original work, or case study. The survey or systematic reviews are mostly consid-
      ered by journals, as they mostly do not depict any new and innovative research
      or new findings. Hence, relevant journals may be considered for publishing them.
      The new or empirical work may be considered for publication in conference or
      journal depending on the status of the work.
   2. Status of the work: As discussed in previous section, the status of the work is an
      important criterion for selection of the place for publication of the research. The
      new or initial idea or research may be considered for publication in a conference
      to obtain the initial feedback about the work. The detailed findings of the research
      may be considered to be published in journals.
   3. Type of audience: The selection of publication venue also depends on the type
      of audience. The study can be considered for publication in journal, when the
      researcher wants to present the work to academic and research community.
      To make the work visible to software industry, the findings may be communicated
      to industry conferences, practitioners’ magazines, or journals. The work may also
      be presented to sponsors or funders in the form of technical reports.
Thus, the initial findings of the scholarly articles may be communicated to the conferences
and the detailed findings can be published as journal papers or a book chapter with atleast
30% of new material added. Some researchers also like to publish the results of the work
on their website or home page.
Reporting Results                                                                                            355
TABLE 9.1
Structure of Journal Paper
Title Author details with affiliation
Abstract                            What is the background of the study?
                                    What are the methods used in the study?
                                    What are the results and primary conclusions of the study?
Introduction
Motivation                         What is the purpose of the study? How does the proposed work relate to the
                                    previous work in the literature?
Research questions                 What issues are to be addressed in the study?
Problem statement                  What is the problem?
Study context                      What are the experimental factors in the study?
Related work                       How is the empirical study linked with literature?
Experimental design
Variables description              What are the variables involved in the study?
Hypothesis formation               What are the assumptions of the study?
Empirical data collection          How is the data collected in the study?
                                   What are the details of data being used in the study?
Data analysis techniques           What are the data analysis techniques being used in the study?
                                   What are the reasons for selecting the specified data analysis techniques?
Analysis process                   What are the steps to be followed during research analysis?
Research methodology
Analysis techniques                What are the details related with the selected techniques identified in
                                    experimental design?
Performance measures               How will the performance of the models developed in the study
                                    analyzed?
Validation methods                 Which validation methods will be used in the study?
Research results
Descriptive statistics             What is the summary statistics of data?
Attribute selection                Which attributes (independent variables) are relevant in the study?
Model prediction                   What is the accuracy of the predicted models? What are the model validation
                                    results?
Hypothesis testing                 What are the results of hypothesis testing?
                                   Is the hypothesis accepted or rejected?
                                                                                                      (Continued)
356                                                     Empirical Research in Software Engineering
9.1.3.1 Abstract
The abstract provides a short summary of the study, including the following components
of the research:
   1.   Background
   2.   Methods used
   3.   Major findings/key results
   4.   Conclusion
The length of the abstract should vary between 200 and 300 words. The abstract should not
provide long descriptions, abbreviations, figures, tables (or reference to figures and tables),
and references.
  The example abstract is shown below:
9.1.3.2 Introduction
This section must answer the questions such as: What is the purpose of the study? Why
the study is important? What is the context of the study? How the study enhances or adds
to the current literature? Hence, the introduction section must include motivation behind
the study, research aims or questions, and the problem statement. The motivation of the
work provides the information about the need of the study to the readers. The purpose of
the study is described in the form of research question, aim, or hypothesis. The relevant
primary studies from the literature (with citations) are provided in this section to provide
the summary of current work to the readers. The introduction section should also pres-
ent the brief details about the approach of the empirical research or study being carried
out. For example, this section should briefly state the techniques, data sets, and validation
methods used in the study.
  The introduction section should be organized in the following steps:
The extract from the introduction section describing the research aims is shown in Figure 9.1.
              In this work, the fault-prone classes are predicted using the object-oriented (OO)
              metrics design suite given by Chidamber and Kemerer (1994). The results are
              validated using latest version of Android data set containing 200 Java classes.
              Thus, this work addresses the following research questions:
                  RQ1: What is the overall predictive performance of the machine learning
                  techniques on open source software?
                  RQ2: Which is the best machine learning technique for finding effect of OO
                  metrics on fault prediction?
                  RQ3: Which machine learning techniques are significantly better than other
                  techniques for fault prediction?
FIGURE 9.1
Portion from introduction section.
358                                                         Empirical Research in Software Engineering
FIGURE 9.2
Example experimental process.
variable being used, data-collection procedure, and hypothesis for research. The size, nature,
description about subjects, and the source of data should be provided here. The tools
(if any) used to collect the data must also be provided in this section. The analysis process
to be followed while conducting the research is presented. For detailed procedure of exper-
imental design refer Chapter 4. The data-collection procedure is explained in Chapter 5.
For example, the experimental process depicted in Figure 9.2 is followed for conducting
research to find the answers to questions given in Figure 9.1.
9.1.3.9 Conclusions
This section presents the main findings and contributions of the empirical study. The future
directions are also presented in this section. It is important to focus on the commonalities
and differences of the study from previous studies.
9.1.3.10 Acknowledgment
The persons involved in the research that do not satisfy authorship criterion should be
acknowledged. These include funding or sponsoring agencies, data collectors, and reviewers.
9.1.3.11 References
References acknowledge the background work in the area and provide the readers a list
of existing work in the literature. The references are presented at the end of the paper and
should be cited in the text. There are software packages such as Mendeley available for
maintaining the references.
9.1.3.12 Appendix
The appendix section presents the raw data or any related material that can help the reader
or targeted audience to better understand the study.
   The claims of contributions, novelty in the work, and difference of the work from the
literature work are the main concerns that need to be addressed while writing a research
paper.
                                                                                                                                                                    Chapter 5—Interpretation
                           Motivation                                 Significant                            Data collection                        Presenting                                 Discussion of                            Summary of
Chapter 3—Research
                                                                                                                               Chapter 4—Research
  Chapter 1—Introduction
                                                                                                                                                                                                                Chapter 6—Conclusions
                           purpose                                    findings in the                        Hypothesis                             results                                    the results                              important
                                                                                           methodology
                           Organization of                            literature                             formulation                            Data                                       Generalization                           findings
                                                                                                                                     results
                           thesis                                     Gaps in the                            Method                                 preprocessing                              of the results                           Future work
                                                                      literature                             selection                                                                         Limitations of
                                                                                                                                                    Hypothesis
                                                                                                                                                    testing                                    the study
FIGURE 9.3
Format of thesis.
degree. It  ensures that the student is capable of carrying out independent research at a
later stage. The masters or doctoral work may involve conducting new research, carrying
an empirical study, development of new tool, exploring an area, development of new tech-
nique or method, and so on.
   The selection of right area and supervisor are the first and most important steps for
the masters and doctoral students. Each thesis has common structure, although different
topics. The general format of masters or doctoral thesis is presented in Figure 9.3, and the
description of each section is given below. The thesis begins with abstract that provides
a summary of the problem statement, purpose, data sources, research methods, and key
findings of the work.
          1. Chapter 1—Introduction
             The first chapter should clearly state the purpose, motivation, goals, and  sig-
             nificance of the study by describing how the work adds to the existing body
             of knowledge in the specified area. This chapter should also describe the prac-
             tical significance of the work to the researchers and software practitioners.
             The doctoral students should describe the original contribution of their work.
             This chapter provides the organization of the rest of the thesis. This part of the
             thesis is most critical and, hence, should be well written with strong theoretical
             background.
         2. Chapter 2—Literature Review
            This chapter should not merely provide the summary of the literature, but rather
            should analyze and discuss the findings in the literature. It must also describe
            what is not found in the literature. This chapter provides the basis of the research
            questions and hypothesis of the study.
         3. Chapter 3—Research Methodology
            The third chapter describes the research context. It describes the data-collection
            procedure, data analysis steps, and techniques description. The research settings
            and details of tools used are also provided.
         4. Chapter 4—Research Findings
            The results of the tests of model prediction and/or hypothesis testing are presented
            in this section. The positive as well as the negative results should be reported. The
            results are summarized and presented in the form of tables and figures, respec-
            tively. This chapter can be divided into logical subsections.
Reporting Results                                                                         361
The research misconduct must be carefully examined to assess the validity of the sus-
pected or reported incident. Plagiarism is a serious offence, and the issues involved in it
are described in next subsection.
362                                                         Empirical Research in Software Engineering
FIGURE 9.4
Guidelines to researchers regarding plagiarism.
9.3.1 Plagiarism
IEEE  defines plagiarism as “the reuse of someone else’s prior ideas, processes, results,
or words without explicitly acknowledging the original author and source.”  Plagiarism
involves paraphrasing (rearranging the original sentence) or copying someone else’s
words without acknowledging the source. Reusing or copying one’s own work is called
self-plagiarism. IEEE/ACM guidelines define serious actions against authors committing
plagiarism (ACM 2015). The institutions (universities and colleges) also define their indi-
vidual policies to deal with plagiarism issues. The faculty and students should be well
informed about the ethics and misconduct issues by providing them guidelines and poli-
cies, and imposing these guidelines on them. Figure 9.3 depicts guidelines for researchers
and practitioners to avoid plagiarism. The researchers, publishers, employers, and agen-
cies may use the plagiarism software for scanning the research paper. There are various
open source softwares such as Viper and proprietary software such as Turnitin available
for checking the documents for plagiarism.
  Finally, it is the primary responsibility of research institutions to ensure, monitor, detect,
and investigate research misconduct. Serious actions must be taken against individuals
caught with plagiarism or misconduct.
Exercises
   9.1 What is the importance of documenting the empirical study?
   9.2 What is the importance of related work in an empirical study?
   9.3 What is the importance of interpreting the results rather than simply stating
      them?
   9.4 When reporting a replicated study, what are the most important things that a
      researcher must keep in mind?
Reporting Results                                                                           363
Further Readings
An excellent study that provides guidelines on empirical research in software engineering
is given below:
  M. Shaw, “Writing good software engineering research papers,” Proceedings of the 25th
    International Conference on Software Engineering, IEEE Computer Society, Portland,
    OR, pp. 726–736, 2003.
Perry et  al. provided a summary of various phases of empirical studies in software
engineering in their article:
As seen in Chapter 5, software repositories can be mined to assess the data stored over
a long period of time. Most of the previous chapters focused on techniques that can be
applied on structured data. However, in addition to structured data, these repositories
contain large amount of data present in unstructured form such as the natural language
text in the form of bug reports, mailing list archives, requirements documents, source code
comments, and a number of identifier names. Manually analyzing such large amount of
data is very time consuming and practically impossible. Hence, text mining techniques are
required to facilitate the automated assessment of these documents.
  Mining unstructured data from software repositories allows analyzing the data related
to software development and improving the software evolutionary processes. Text min-
ing involves processing of thousands of words extracted from the textual descriptions. To
obtain the data in usable form, a series of preprocessing tasks like tokenization, removal of
stop words, and stemming must be applied to remove the irrelevant words from the docu-
ment. Thereafter, a suitable feature selection method is applied to reduce the size of initial
feature set leading to more accuracy and efficiency in classification.
  There are various artifacts produced during the software development life cycle.
The  numerical and structured data mined using text mining could be effectively used
to predict quality attributes such as fault severity. For example, consider the fault track-
ing systems of many open source software systems containing day-to-day fault-related
reports that can be used for making strategic decisions such as properly allocating
available testing resources. These repositories also contain unstructured data on vulner-
abilities, which records all faults that are encountered during a project’s life cycle. The
Mozilla Firefox is one such instance of open source software that maintains fault records
of vulnerabilities. While extensive testing can minimize these faults, it is not possible to
completely remove these faults. Hence, it is essential to classify faults according to their
severity and then address the faults that are more severe as compared to others. Text
mining can be used to mine relevant attributes from the unstructured fault descriptions
to predict fault severity.
  In this chapter, we define unstructured data, describe techniques for text mining, and
present an empirical study for predicting fault severity using fault description extracted
from bug reports.
10.1 Introduction
According to the researchers, it has been reported that nearly 80%–85% of the data is
unstructured in contrast to structured data like source code, execution traces, change logs,
and so on (Thomas et  al. 2014). The software library consists of fault descriptions, soft-
ware requirement documents, and other related documents. For example, the software
                                                                                          365
366                                                 Empirical Research in Software Engineering
available and the task is to associate each of the objects to any one of the categories.
There are multiple categories that a document may belong to, out of N categories avail-
able an object can belong to any of the categories. For example, in a requirement specifica-
tion document, the requirement may belong to various quality attributes such as security,
availability, usability, maintainability, and so on. There may be the case that a particular
object does not fit into any of the available N categories, or an object is suitable to be fit into
more than one category. Any kind of such combination is permissible in case of multiple
classifications. The need to perform N separate classification tasks is time consuming and
generally computationally expensive.
are included. Then, the classification is done for N categories. This approach works faster
but the accuracy can be less. In the local dictionary approach, a separate dictionary consid-
ering the terms only related to a given category is created. Thus, the dictionary is small but
the cost of computing N models to predict N categories is more than the global dictionary
approach (Bramer 2007).
FIGURE 10.1
Steps in text mining.
Mining Unstructured Data                                                                                   369
• Tokenization
• Stemming
FIGURE 10.2
Preprocessing techniques.
all the stop words like prepositions, articles, conjunctions, verbs, pronouns, nouns,
adjectives, and adverbs are removed from the document. Finally, the most important
step of preprocessing is performed, which is referred to as stemming. It is defined as the
process of removing the words that have the same stem, thereby retaining the stem as the
selected feature. For instance, words like “move,” “moves,” “moved,” and “moving” can
be replaced with a single word as “move.” After preprocessing steps, a set of features are
obtained by reducing the initial size of the feature space.
       Example 10.1:
       Consider an example of software requirements given in Table  10.1. This example
       demonstrates the description of the various software requirements, which describe the
       nonfunctional requirement (NFR; quality attributes). Generally, it has been observed
       that these requirements are not properly defined in the software requirement document
       and are scattered throughout the document in an ad hoc fashion. As we know, these
       qualities play an important role for the development and behavior of the software
TABLE 10.1
Original Data Consisting of Twelve Requirements and Their Description
Req. No.                                 Requirement Description                                     NFR Type
RQ1         the product shall be available during normal business hours. As long as the user has       1
             access to the client pc the system will be available 99% of the time during the first
             six months of operation.
RQ2         the product shall have a consistent color scheme and fonts.                                2
RQ3         the system shall be intuitive and self-explanatory.                                        3
RQ4         the user interface shall have standard menus buttons for navigation.                       2
RQ5         out of 1000 accesses to the system, the system is available 999 times.                     1
RQ6         the product shall be available for use 24 hours per day 365 days per year.                 1
RQ7         the look and feel of the system shall conform to the user interface standards of the       2
             smart device.
RQ8         the system shall be available for use between the hours of 8 am and 6 pm.                  1
RQ9         the system shall achieve 95% up time.                                                      1
RQ10        the product shall be easy for a realtor to learn.                                          3
RQ11        the system shall have a professional appearance.                                           2
RQ12        the system shall be used by realtors with no training.                                     3
370                                                       Empirical Research in Software Engineering
       system and, therefore, these qualities should be incorporated in the architectural design
       as early as possible, which is not the case. Hence, in this example, we intend to employ
       text mining techniques to mine these requirements and convert them into a structured
       form that can then be used for the prediction of the unknown requirements into their
       respective nonfunctional qualities. The data presents few requirements extracted from
       promise data repository. Here, we categorize the requirements into three different type
       of NFRs, namely, availability (A), look-and-feel (LF), and usability (U). These three
       NFRs have been labeled as type 1, 2, and 3, respectively. The original data in its raw
       form containing the description of the NFRs along with the type of NFR is given in
       Table 10.1.
10.2.2.1 Tokenization
The main aim of text mining is to extract all the relevant words in a given set of documents.
Tokenization is the first step in preprocessing. In tokenization, a document consisting of
various characters is divided into a well-defined collection of tokens. The process involves
removal of irrelevant numbers, punctuation marks, and replacement of special and non-
text characters with blank spaces. After removing all the unwanted characters, the entire
document is converted into lowercase. This tokenized representation forms the founda-
tion of extracting words for sentences. Table 10.2 represents the tokenized data obtained
after tokenizing the original data shown in Table 10.1.
 TABLE 10.2
 Tokenized Data Obtained after Tokenizing the Original Data
 Req. No.                                      Requirement Description
 RQ1         the product shall be available during normal business hours as long as the user has access to the
              client pc the system will be available of the time during the first six months of operation
 RQ2         the product shall have a consistent color scheme and fonts
 RQ3         the system shall be intuitive and self-explanatory
 RQ4         the user interface shall have standard menus buttons for navigation
 RQ5         out of accesses to the system the system is available times
 RQ6         the product shall be available for use hours per day days per year
 RQ7         the look and feel of the system shall conform to the user interface standards of the smart device
 RQ8         the system shall be available for use between the hours of 8 am and 6 pm
 RQ9         the system shall achieve up time
 RQ10        the product shall be easy for a realtor to learn
 RQ11        the system shall have a professional appearance
 RQ12        the system shall be used by realtors with no training
Mining Unstructured Data                                                                                 371
   TABLE 10.3
   Top-100 Stop Words
   a                     Ah                  Anybody                   aside                be
   able                  Aint                Anyhow                    ask                  became
   about                 all                 anymore                   asking               because
   above                 allow               anyone                    associated           become
   abst                  allows              anything                  at                   becomes
   accordance            almost              anyway                    auth                 becoming
   according             alone               anyways                   available            been
   accordingly           along               anywhere                  away                 before
   across                already             apart                     awfully              beforehand
   act                   also                apparently                back                 begin
   actually              although            appear                    beginning            better
   added                 always              appreciate                beginnings           between
   adj                   am                  appropriate               begins               beyond
   affected              among               approximately             behind               biol
   affecting             amongst             are                       being                both
   affects               an                  aren                      believe              brief
   after                 and                 arent                     below                briefly
   afterwards            announce            arise                     beside               but
   again                 another             around                    besides              by
   against               any                 as                        best                 cmon
    TABLE 10.4
    Data Obtained after Removing the Stop Words from the Tokenized Data
    Req. No.                                 Requirement Description
    RQ1          product normal business hours long user access client pc system time months operation
    RQ2          product consistent color scheme fonts
    RQ3          system intuitive explanatory
    RQ4          user interface standard menus buttons navigation
    RQ5          accesses system times
    RQ6          product hours day days year
    RQ7          feel system conform user interface standards smart device
    RQ8          system hours 8 am–6 pm
    RQ9          system achieve time
    RQ10         product easy realtor learn
    RQ11         system professional appearance
    RQ12         system realtors training
  Table 10.4 represents the data obtained after removing the stop words from the tokenized
data shown in Table 10.2.
      TABLE 10.5
      Data Obtained after Performing Stemming
      Req. No.                                 Requirement Description
      RQ1               product normal busi hour long user access client pc system time month oper
      RQ2               product consist color scheme font
      RQ3               system intuit explanatori
      RQ4               user interfac standard menus button navig
      RQ5               access system system time
      RQ6               product hour day day year
      RQ7               feel system conform user interfac standard smart devic
      RQ8               system hour 8 am 6 pm
      RQ9               system achiev time
      RQ10              product easi realtor learn
      RQ11              system profession appear
      RQ12              system realtor train
in  1980, is the most widely used. The algorithm is imported from NuGet Package
Manager for .NET Framework (www.nuget.org). Porter’s Stemming Algorithm provides
a set of rules that iteratively reduce English words by replacing them with their stems.
Table  10.5 represents the stemmed data obtained after performing stemming on data
given in Table 10.4.
 Now, the total number of bits required to code any particular class distribution C is
H(C0). It is given by the following formula:
                                                  N=   ∑n ( c )
                                                       c∈C
                                                           n(c)
                                                  p(c) =
	                                                           N
                                      H (C ) = −   ∑ p ( c ) log p ( c )
                                                                       2
c∈C
Now, suppose A is a group of attributes, then the total number of bits needed to code a
class once an attribute has been observed is given by the following formula:
a∈A c∈C
Now, the attribute that obtains the highest information gain is considered to be the highest
ranked attribute, which is denoted by the symbol Ai.
Infogain ( Ai ) = H ( C ) − H (C|Ai )
Table 10.6 shows the list of words that are sorted on the basis of Infogain measure. These
words are the unique words that are obtained after stemming the data. As we can see
from the stemmed data, there are a total of 39 unique words. The Infogain of all these
words was calculated and then they were given the rank, which is shown in Table 10.6.
On the basis of this table, top-5 words, top-25 words, and so on can also be obtained by
using Infogain measure.
  To understand the concept of Infogain measure, the calculation of Infogain correspond-
ing to two words “realtor” and “system” has been shown below. As presented in Table 10.6,
the word “realtor” has been ranked 1 and the word “system” has been given the rank
38. This will become clear from their respective Infogain values. Table 10.7 represents the
matrix of 0’s and 1’s corresponding to the two words, namely, “system” and “realtor.” This
         TABLE 10.6
         List of Words Sorted on the Basis of Infogain Measure
         Rank         Words          Rank         Words         Rank        Words       Rank   Words
                              TABLE 10.7
                              Matrix Representing Occurrence of a Word in
                              a Document
                              Doc#           System          Realtor         NFRType
                              1                  1              0                 1
                              2                  0              0                 2
                              3                  1              0                 3
                              4                  0              0                 2
                              5                  1              0                 1
                              6                  0              0                 1
                              7                  1              0                 2
                              8                  1              0                 1
                              9                  1              0                 1
                              10                 0              1                 3
                              11                 1              0                 2
                              12                 1              1                 3
                                    5       5   4       4   3       3 
                 Entropy ( S ) = −    log 2  −    log 2  −    log 2 
                                    12     12   12     12   12     12 
                                   = 1.553
The Infogain measure of the word “realtor” is as below:
                                                2                10
Infogain ( S, realtor ) = Entropy ( S ) −         Entropy ( 1) −    Entropy ( 0 )
                                               12                12
                 2        2           5         5   4       4   1       1 
= 1.553 − 0.17  −    log 2   − 0.83  −    log 2  −    log 2  −    log 2  
                 2        2           10       10   10     10   10     10  
= 0.424
Mining Unstructured Data                                                                               375
                                             8                4 
   Infogain ( S, System ) = Entropy ( S ) −   Entropy ( 1) −   Entropy ( 0 )
                                             12               12 
                                       8   4       4    2        2    2       2   
                          = 1.553 −       −    log 2   −     log 2   −    log 2   
                                      12   8       8    8        8    8       8   
                                 4    1        1    2       2    1       1   
                            −         −    log 2   −    log 2   −    log 2   
                                12    4        4    4       4    4       4   
                          = 0.053
Thus, we can see, the Infogain value of the two words, namely, “realtor” and “system” is
0.424 and 0.053, respectively, which is calculated by using the above formulae. Similarly,
the Infogain measure of all the words obtained after stemming was calculated, and then
these words were ranked based on their value. The top few words were then used for the
developing the prediction model.
The weighted frequency count is 0 if the term is not present in the document, and a nonzero
value otherwise. There are many ways to represent the normalized or weighted term frequency
in a document. The following formula can be used to compute the normalized frequency count:
376                                                     Empirical Research in Software Engineering
where:
 nj is the total number of documents containing jth term
 n is the total number of documents
This value is a combination of the terms that occur frequently in a particular document
with the terms, which occur rarely among a group of documents.
  TFIDF method is considered to be the most efficient method for weighting the terms.
The  TFIDF value of a term given in a document (Xij) is defined as the product of two
values, which correspond to the term frequency and the inverse document frequency,
respectively. It is given as:
                                                                 n     
                                  TFIDF ( X ij ) = t ij × log 2       
                                                                  nj    
where:
 tij is the frequency of the jth term in document i
 nj is the total number of documents containing jth term
 n is the total number of documents
Term frequency takes the terms that are frequent in the given document to be more impor-
tant than the others. Inverse document frequency takes the terms that are rare across a
group of documents to be more important than the others.
      Example 10.2
      Consider Table  10.8, the number of occurrences of each term in the corresponding
      document is shown. The row represents occurrence of each term in a document and
                     TABLE 10.8
                     Term Frequency Matrix Depicting the Frequency Count
                     of Each Term in the Corresponding Document
                     Document/Term         t1          t2         t3         t4   t5
                     d1                     0          2           8          0   0
                     d2                     5         20           8         15   0
                     d3                    14          0           0          5   0
                     d4                    20          4          13          0   5
                     d5                     0          0           9          7   4
Mining Unstructured Data                                                                      377
     column represent occurrences of a given term in each document. Based on the table, the
     TFIDF value can be calculated. For example, TFIDF value of term t4 in document d2 is
     calculated as below:
= 1.337
                                          5
                         IDF(t4 ) = log 2   = 0.736
                                          3
                                             n    
                         TFIDF = t4 × log 2        = 1.337 × 0.736 = 0.985
                                             n4   
Now, before using the set of N-dimensional vectors, we first need to normalize the values
of the weights. It has been observed that “normalizing” the feature vectors before sub-
mitting them to the learning technique is the most necessary and important condition.
  Table 10.9 shows the TFIDF matrix corresponding to the top-5 words. Now, this matrix
represents the structured form of the original raw data that can now be used for the
development of the prediction model.
            TABLE 10.9
            TFIDF Matrix of Top-5 Words of NFR Example
            Doc    Realtor      Hour        Time       Interfac   Standard     NFR Type
            1      0          2.115477     2.115477     0          0              1
            2      0          0            0            0          0              2
            3      0          0            0            0          0              3
            4      0          0            0            2.70044    2.70044        2
            5      0          0            2.115477     0          0              1
            6      0          2.115477     0            0          0              1
            7      0          0            0            2.70044    2.70044        2
            8      0          2.115477     0            0          0              1
            9      0          0            2.115477     0          0              1
            10     2.70044    0            0            0          0              3
            11     0          0            0            0          0              2
            12     2.70044    0            0            0          0              3
378                                                Empirical Research in Software Engineering
10.3.1 Mining the Fault Reports to Predict the Severity of the Faults
One of the most popular applications of text mining is to mine the fault descriptions
available in software repositories and extract relevant information from the description,
which is in the form of some relevant words extracted by employing text mining tech-
niques. Thus, the data is reduced to a structured format, which can now be applied for the
development of prediction models. With the help of this structured data, the severities of
the document could be predicted, which is one of the most important aspect of the fault
reports. The prediction of fault severity is very important as the faults with higher severity
could be dealt first on a priority basis, thus leading to an efficient utilization of the avail-
able resources and manpower. Menzies and Marcus (2008) mined fault description using
text mining and machine learning techniques using rule-based learning.
could be taken into account and development of an efficient software product meeting the
stakeholder’s real needs could be achieved.
  Apart from these, there are various other applications of text mining that are restricted
not only to the area of software engineering, but also to other fields of the literature like
medicine, networking, chemicals, and so on.
10.4.1 Introduction
As the complexity and size of the software is increasing, the introduction of faults into the
software has become an implicit part of the development, which cannot be avoided whatso-
ever the circumstances may be. This causes the faults to enter the software, thereby leading
to functional failures. There are a number of bug tracking systems such as Bugzilla and CVS
that are widely used to track the faults present in various open source software repositories.
The faults, which are introduced in the software, are of varying severity levels. As a result,
these bug tracking systems contain the fault reports that include detailed information about
the faults along with their IDs and associated severity level. A severity level is used by
many organizations to measure the effect of fault on the software. This impact may range
from mild to catastrophic, wherein catastrophic faults are most severe faults that may lead
the entire system to go to a crash state. The faults that have a severe impact on the func-
tioning of the software and may adversely affect the software development are required
to be handled on priority basis. Faults having high-severity level must be dealt with on a
priority basis as their presence may lead to a major loss like human life loss, crash of an
airplane, and so on. However, the data present in such systems is generally in unstructured
form. Hence, text mining techniques in combination with machine learning techniques are
required to analyze the data present in the defect tracking system.
   In this study the information from the NASA’s database called project and issue tracking
system (PITS) is mined, by developing a tool that will first extract the relevant informa-
tion from PITS-A using text mining techniques. After extraction, the tool will then predict
the fault severities using machine learning techniques. The faults are classified into five
categories of severity by NASA’s engineers as very high, high, medium, low, and very low.
   In this study multilayer perceptron technique is used to predict the faults at various
levels of severity. The prediction of fault severity will help the researchers and software
practitioners to allocate their testing resources on more severe areas of the software. The
performance of the predicted model will be analyzed using area under the ROC curve
(AUC) obtained from ROC analysis.
380                                              Empirical Research in Software Engineering
  Testing is an expensive activity that consumes maximum resources and cost in the
software development life cycle. This study is particularly useful for the testing profes-
sionals for quickly predicting severe faults under time and resource constraints. For exam-
ple, if only 25% of testing resources are available, the knowledge of the severe faults will
help the testers to focus the available resources on fixing the severe faults. The faults with
higher severity level should be tested and fixed before the faults with lower severity level
are tested and fixed (Menzies and Marcus 2008). The testing professionals can select from
the list of prioritized faults according to the available testing resources using models pre-
dicted at various severity levels of faults. The models developed in this study will also
guide the testing professional in deciding when to stop testing—when an acceptable num-
ber (perhaps decided using past experiences) of faults have been corrected and fixed, then
the testing team may decide to stop testing and allow the release of the software.
of Infogain measure. In this study, top-5, 25, 50, and 100 words were considered and the
performance of the model was evaluated with respect to these words corresponding to
each of the severity level, namely, high, medium, low, and very low. Table 10.10 demon-
strates the top-100 words obtained after the ranking done by Infogain measure. From this
table, top-5, 25, and 50 words can also be obtained.
   MLP is one of the most popular algorithm that is used for supervised classification. It is
responsible for mapping a set of input values onto a set of appropriate output values.
MLP technique is based on the concept of back propagation. Back propagation is a type
of learning procedure that is used to train the network. It comprises of two passes—for-
ward pass and backward pass. In the forward pass, the inputs are fed to the network and
then the effect is propagated layer by layer by keeping all the weights of the network
fixed. In the backward pass, the updation of weights takes place according to the error
computed. The process is repeated over and over again until the desired performance is
achieved.
   To evaluate the performance of the predicted model, there were different performance
measures that were used. These were sensitivity, AUC, and the cutoff point. All these
measures determine how well the model has predicted what it was intended to predict.
The explanation for these measures has been provided in Chapter 7. The validation method
     TABLE 10.10
     Top-100 Terms in PITS-A Data Set, Sorted by Infogain
     Rank       Words       Rank     Words       Rank       Words       Rank    Words
used in the study is holdout validation (70:30 ratio) in which the entire data set is divided
into 70% training data and the remaining 30% as test data. A partitioning variable is used
that splits the given data set into training and testing samples in 70:30 ratio. This variable
can have the value either 1 or 0. All the cases with the value of 1 for the variable are assigned
to the training samples, and all the other cases are assigned to the testing samples. To get more
generalized and accurate results, validation has been performed using 10 separate partitioning
variables. Each time, MLP method is used for training, and the testing samples are used to
validate the results.
                   180,000
                                 156,499
                   160,000
                                                     150,921
                   140,000
                   120,000
                   100,000                                           89,119
                                                                                  89,119
                    80,000
                    60,000
                    40,000
                    20,000
                         0
                             Original words    Words after       Words after   Words after
                                               tokenization      stop words    stemming
                                                                   removal
FIGURE 10.3
Results of applying preprocessing to the PITS-A data set.
Mining Unstructured Data                                                                383
              TABLE 10.11
              Results for Top-5 Words Corresponding to High and Medium
              Severity Faults
                            High Severity Defects       Medium Severity Defects
              Runs      AUC     Sensitivity   Cutoff   AUC     Sensitivity   Cutoff
              1         0.873      0.778      0.1818   0.785     0.545       0.4159
              2         0.824      0.689      0.1632   0.786     0.519       0.4363
              3         0.853      0.781      0.1655   0.759     0.553       0.432
              4         0.852      0.733      0.2107   0.785     0.581       0.3892
              5         0.862      0.78       0.1698   0.798     0.848       0.4652
              6         0.872      0.772      0.1095   0.727     0.52        0.5539
              7         0.83       0.2        0.1561   0.778     0.543       0.4477
              8         0.84       0.753      0.1784   0.777     0.557       0.4361
              9         0.897      0.833      0.1775   0.829     0.81        0.4311
              10        0.868      0.765      0.1811   0.782     0.514       0.438
              Average   0.857      0.708         –     0.781     0.599          –
              TABLE 10.12
              Results for Top-5 Words Corresponding to Low and Very Low
              Severity Faults
                            Low Severity Defects       Very Low Severity Defects
exceptionally well in predicting high-severity faults than in predicting medium, low, and
very low severity faults when top-5 words were considered for classification.
   On such similar lines, conclusion can be drawn regarding the performance of MLP
when taking into account top-25 words. It can be seen from Tables 10.13 and 10.14 that MLP
has predicted high-severity faults with much correctness, as the maximum value of AUC
is 0.903 with approximately 83% value of sensitivity. The performance of MLP is also good
in predicting faults at other severity levels, namely, medium, low, and very low. This is
because the average values of AUC at these severity levels are 0.80, 0.77, and 0.76 approxi-
mately. Thus, it can be said that MLP model is recommended for predicting the faults as
the number of words considered for classification increases.
384                                                    Empirical Research in Software Engineering
              TABLE 10.13
              Results for Top-25 Words Corresponding to High and Medium
              Severity Faults
                            High Severity Defects          Medium Severity Defects
              TABLE 10.14
              Results for Top-25 Words Corresponding to Low and Very Low
              Severity Faults
                            Low Severity Defects          Very Low Severity Defects
              Runs      AUC     Sensitivity   Cutoff      AUC     Sensitivity   Cutoff
              1         0.813      0.741      0.2058      0.697      0.667      0.0216
              2         0.754      0.627      0.2063      0.841      0.75       0.0262
              3         0.794      0.690      0.2218      0.811      0.714      0.0343
              4         0.753      0.636      0.3234      0.717      0.600      0.0244
              5         0.800      0.735      0.3027      0.696      0.667      0.0343
              6         0.774      0.672      0.2168      0.81       0.800      0.0384
              7         0.706      0.652      0.2834      0.804      0.778      0.0367
              8         0.71       0.627      0.297       0.515      0.500      0.0213
              9         0.745      0.671      0.2568      0.855      0.800      0.0373
              10        0.845      0.742      0.3051      0.807      0.625      0.019
              Average   0.769      0.679         –        0.755      0.690         –
  From Tables 10.15 and 10.16, it can be seen that the performance of MLP with respect to
to top-50 words is exceptionally good for all types of faults with the average value of AUC
being 0.91, 0.82, 0.80, and 0.81 at high, medium, low, and very low severities, respectively.
So, from the discussion, it can be concluded that MLP method has worked very well for
predicting the faults when taking into account top-50 words for classification.
  Even when top-100 words are considered for classification (Tables 10.17 and 10.18), it is
seen that the performance of MLP is exceptionally good in predicting high-severity faults
as the maximum value of AUC is 0.928 with 85.3% sensitivity value. The performance of
MLP model is nominal for other severity faults.
Mining Unstructured Data                                                                 385
              TABLE 10.15
              Results for Top-50 Words Corresponding to High and Medium
              Severity Faults
                            High Severity Defects       Medium Severity Defects
              Runs      AUC     Sensitivity   Cutoff   AUC     Sensitivity   Cutoff
              1         0.916      0.865      0.2292   0.812      0.733      0.4776
              2         0.905      0.817      0.1824   0.83       0.742      0.402
              3         0.919      0.841      0.2642   0.831      0.739      0.4214
              4         0.915      0.84       0.2209   0.83       0.736      0.5072
              5         0.937      0.885      0.3282   0.829      0.75       0.4168
              6         0.943      0.876      0.3365   0.846      0.757      0.4338
              7         0.892      0.81       0.1858   0.824      0.748      0.4764
              8         0.906      0.826      0.1963   0.824      0.748      0.4531
              9         0.906      0.828      0.2702   0.792      0.71       0.43
              10        0.875      0.793      0.3138   0.771      0.704      0.3972
              Average   0.911      0.838         –     0.819      0.736         –
              TABLE 10.16
              Results for Top-50 Words Corresponding to Low and Very Low
              Severity Faults
                            Low Severity Defects       Very Low Severity Defects
              Runs      AUC     Sensitivity   Cutoff   AUC     Sensitivity   Cutoff
              1         0.801      0.692      0.224    0.754      0.667      0.026
              2         0.848      0.743      0.273    0.848      0.6        0.020
              3         0.808      0.704      0.246    0.816      0.75       0.032
              4         0.773      0.703      0.185    0.896      0.75       0.028
              5         0.806      0.712      0.191    0.848      0.778      0.027
              6         0.812      0.721      0.226    0.673      0.6        0.017
              7         0.795      0.681      0.230    0.799      0.714      0.040
              8         0.792      0.708      0.256    0.846      0.833      0.056
              9         0.823      0.768      0.240    0.801      0.6        0.022
              10        0.755      0.653      0.260    0.820      0.778      0.031
              Average   0.801      0.708        –      0.810      0.707        –
              TABLE 10.17
              Results for Top-100 Words Corresponding to High and Medium
              Severity Faults
                            High Severity Defects          Medium Severity Defects
              Runs      AUC     Sensitivity   Cutoff      AUC     Sensitivity   Cutoff
              1         0.923      0.863      0.2566      0.782      0.689      0.3952
              2         0.912      0.82       0.3002      0.754      0.687      0.3787
              3         0.918      0.833      0.2244      0.78       0.732      0.3807
              4         0.928      0.853      0.2777      0.802      0.725      0.4713
              5         0.899      0.816      0.2476      0.792      0.694      0.4344
              6         0.904      0.816      0.2944      0.827      0.735      0.4387
              7         0.918      0.848      0.2967      0.83       0.752      0.3464
              8         0.889      0.809      0.2033      0.811      0.73       0.528
              9         0.917      0.821      0.2636      0.803      0.708      0.4026
              10        0.901      0.819      0.1994      0.807      0.742      0.4584
              Average   0.910      0.829         –        0.798      0.719         –
              TABLE 10.18
              Results for Top-100 Words Corresponding to Low and Very Low
              Severity Faults
                            Low Severity Defects          Very Low Severity Defects
              Runs      AUC     Sensitivity   Cutoff      AUC     Sensitivity   Cutoff
              1         0.756      0.693      0.306       0.758      0.6        0.0231
              2         0.785      0.711      0.3199      0.837      0.778      0.0215
              3         0.826      0.741      0.2844      0.881      0.75       0.0421
              4         0.823      0.753      0.2327      0.807      0.75       0.0197
              5         0.766      0.688      0.2922      0.85       0.8        0.0502
              6         0.819      0.75       0.233       0.687      0.5        0.0209
              7         0.835      0.743      0.2711      0.642      0.6        0.0212
              8         0.809      0.719      0.2061      0.663      0.571      0.0142
              9         0.819      0.69       0.2195      0.659      0.571      0.021
              10        0.805      0.729      0.2982      0.669      0.571      0.0261
              Average   0.804      0.722         –        0.745      0.649         –
The results show that there is marginal difference between the AUC of models predicted
using top-50 words and the AUC of models predicted using only top-5 words in most of
the cases. Hence, it is notable that using only very few words (only 5) the models perform
nearly as well as the models predicted in most of the cases using large number of words.
For example, individual results of model using PITS-A data set for top-50 words at high-
severity level are 0.943. However, after reducing the words by 90% (i.e., from 50 words to 5
words), the maximum value of AUC obtained for top-5 words at high-severity level is 0.897.
Mining Unstructured Data                                                                  387
10.4.7 Conclusion
In today’s scenario, there is an emerging need for the development of the defect prediction
models, which are capable of detecting the fault introduced in the software. Not only this,
the most important faults to consider in terms of the faults introduced in the software is
its severity level that may range from mild to catastrophic. Catastrophic faults are the most
severe faults that must be identified and then dealt with as early as possible to prevent any
kind of damage to the software much further. With this intent, development of the fault
prediction model has been carried out using MLP as the classification method. The data
set employed for validation is the PITS data set, which is being popularly used by NASA’s
engineers. The data set was mined using text mining steps and the relevant information
was extracted in terms of top few words (top-5, 25, 50, and 100 words). These words were
used to predict the model. The predicted model was then used to assign a severity level to
each of the fault found during testing.
   It was observed from the results that MLP model has performed exceptionally well in
predicting the faults at high-severity level irrespective of the number of words considered
for classification. This observation is evident from the values of AUC lying in the range of
0.824–0.943. The performance of the model is even good for predicting medium severity
faults as the maximum value of AUC is as high as 0.846. On the other hand, with respect to
the faults at low and very low severity levels, the performance of the model is considered
to be nominal. Thus, it can be concluded that the performance of the model is best when
top-50 words are taken into account. Also, with very few words (only 5), the model has
performed nearly as well as the models predicted in most of the cases using large number
of words. From this analysis, it can be said that the model is suitable for predicting the
severity levels of the faults even with very few words. This would be highly beneficial
for an overall development of the organization in terms of proper allocation of testing
resources and the available manpower.
Exercises
  10.1 What is text mining? What are the applications of text mining in software
     engineering?
  10.2 Explain the steps in text mining. Why mining relevant attributes is important
     before applying data analysis techniques?
388                                                  Empirical Research in Software Engineering
                                            Number of Documents
                                  Term       Containing the Term
                                  Train                200
                                  User                 100
                                  Learn                  4
                                  Display               50
                                  System                20
  10.10 Consider the following example given below, calculate the Infogain for each
     term.
Document/Term T1 T2 T3 T4
                             D1                  1      0     1    0
                             D2                  0      0     0    1
                             D3                  1      1     0    1
                             D4                  0      0     1    1
                             D5                  1      1     0    0
Further Readings
The following books provides techniques and procedures for information retrieval:
In the following paper a novel approach that applies frequent item for text clustering is
presented:
The objective of this chapter is to demonstrate and present the practical application of
the empirical concepts and procedures presented in previous chapters. This chapter also
follows the report structure given in Chapter 9. The work presented in this chapter is
based on change prediction.
   The three important criteria for comparing results across various studies are (1) the
study size, (2) the way in which the performance of the developed models is measured,
and (3) statistical tests used. Also, the availability of data sets has always been a constraint
in the software engineering research. The use of stable performance measures is another
factor to be considered. The statistical tests for comparing the actual significance of results
are not much used in change prediction models. Moreover, the models should be validated
on the different data from which they are actually derived to increase the confidence in
the conclusions of the study. To resolve these issues, in this chapter, we compare one sta-
tistical technique and 17 machine learning (ML) techniques for investigating the effect of
object-oriented (OO) metrics on change-prone classes. The hypothesis is based on the fact
that there is a statistical difference between the performance of the compared techniques.
11.1 Abstract
Software maintenance is a predominant and crucial phase in the life cycle of a soft-
ware product as evolution of a software is important to keep it functional and profitable.
Planning of the maintenance activities and distribution of resources is a significant step
towards developing a software within the specified budget and time. Change prediction
models help in identification of classes/modules that are prone to change in the future
releases of a software product. These classes represent the weak parts of a software. Thus,
change prediction models help software industry in proper planning of maintenance
activities as change-prone classes should be allocated greater attention and resources for
developing a good quality software.
11.1.1 Basic
Change proneness is an important software quality attribute as it signifies the probability
that a specific class of a software would change in the forthcoming release of a software.
A number of techniques are available in literature for development of change prediction
models. This study aims to compare and assess one statistical and 17 ML techniques for
effective development of change prediction models. The issues addressed are (1) comparing
of the ML techniques and the statistical technique over popular data sets, (2) use of vari-
ous performance measures for evaluating the performance of change prediction models,
                                                                                            391
392                                             Empirical Research in Software Engineering
(3) use of statistical tests for comparing and assessing the performance of ML techniques,
and (4) validation of models from different data sets from which they are trained.
11.1.2 Method
To perform comparative analysis of one statistical and 17 ML techniques, the study devel-
oped change prediction models on six open source data sets. The data sets are application
packages of the widely used Android OS. The developed models are statistically assessed
using statistical tests on a number of performance measures.
11.1.3 Results
The results of the study indicate logistic regression (LR), multilayer perceptron (MLP), and
bagging (BG) techniques as good techniques for developing change prediction models.
The results of the study can be effectively used by software practitioners and researchers
in choosing an appropriate technique for developing change prediction models.
11.2 Introduction
11.2.1 Motivation
Recently, there has been a surge in the number of studies that develop models for pre-
dicting various software quality attributes such as fault proneness, change proneness,
and maintainability. These studies help researchers and software practitioners in effi-
cient resource usage and developing cost-effective, highly maintainable good quality
software products. Change proneness is a critical software quality attribute that can
be assessed by developing change prediction models. Identification of change-prone
classes is crucial for software developers as it helps in better planning of constraint
project resources like time, cost, and effort. It would also help developers in taking pre-
ventive measures such as better designs and restructuring for these classes in the earlier
phases of software development life cycle so that minimum defects and changes are
introduced in such classes. Moreover, such classes should be meticulously tested with
stringent verification processes like inspections and reviews. This would help in early
detection of errors in the classes so that developers can take timely corrective actions.
Although a number of techniques have been exploited and assessed to develop mod-
els for ascertaining change-proneness attribute, the search for the best technique still
continues. The academia as well as the industry is tirelessly exploring the capabilities
of different techniques to evaluate their effectiveness in developing efficient prediction
models. Thus, there is an urgent need for comparative assessment of various techniques
that can help the industry and researchers in choosing a practical and useful technique
for model development.
11.2.2 Objectives
To develop software quality prediction models, various software metrics are used as the
independent variables and a particular software quality attribute as the dependent vari-
able. The model is basically a set of classification rules that can predict the dependent
Demonstrating Empirical Procedures                                                         393
variable on the basis of the independent variables. The classification model can be created
by a number of techniques such as the statistical technique (LR) or ML techniques.
The  capability of a technique can be assessed by evaluating the model developed by
the particular technique. The various performance evaluators that are used in the study
are classification accuracy, precision, specificity, sensitivity, F-measure, and area under
the receiver operating characteristic (ROC) curve (AUC). According to Afzal and Torkar
(2008), the use of a number of performance measures strengthens the conclusions of the
study. This empirical study ascertains the comparative performance of statistical and ML
techniques for the prediction of change-proneness attribute of a class in an OO software.
Moreover, this study assesses models developed using tenfold cross-validation that are
widely acceptable models in research (Pai and Bechta Dugan 2007; De Carvalho et  al.
2010). Use of tenfold cross-validation reduces validation bias and increases the conclusion
validity of the study. The study also strengthens its conclusions by statistically compar-
ing the models developed by various techniques using Friedman and Nemenyi post hoc
test. Furthermore, this study analyzes the application packages of a widely used mobile
operating system named Android, which is open source in nature. Selection of such sub-
jects for developing model increases the generalizability of the study and increases the
applicability of the study’s results.
11.2.3 Method
This study empirically validates six open source data sets to evaluate the performance of
18 different techniques for change-proneness prediction. The comparative assessment of
various techniques is evaluated with the help of Friedman statistical test and Nemenyi
post hoc test. The Friedman test assigns a mean rank to all the techniques on the basis
of a specific performance measure and tests whether the predictive performance of all
the techniques is equivalent. In case the predictive performance of various techniques is
found to be statistically significantly different, there is a need to perform Nemenyi post
hoc test. The Nemenyi test compares the results of each pair of techniques to ascertain
the better performing technique among the two compared techniques. It computes the
critical distance for the performance of two techniques and checks whether a pair of
techniques are significantly different from each other or not. The study evaluates the
change prediction models using six performance measures. The data sets used in the
study are collected from the GIT repository using the defect collection and reporting
system (DCRS) tool.
Dejager et al. 2013; Malhotra and Khanna 2013). This study evaluates the capabilities of
17 ML techniques for developing change prediction models and compares their predic-
tive capability with LR. The ML techniques explored in this case study include adaptive
boosting (AB), alternating decision tree (ADT), BG, Bayesian network (BN), decision
table (DTab), J48 decision tree, repeated incremental pruning to produce error reduction
(RIPPER), LogitBoost (LB), MLP, naïve Bayes (NB), non-nested generalized exemplars
(NNge), random forests (RF), radial basis function (RBF) network, REP tree (REP), support
vector using sequential minimal optimization (SVM-SMO), voted perceptron (VP), and
ZeroR techniques implemented in the WEKA tool.
of cohesion among methods (LCOM). The study also analyzed the quality model for
object-oriented design metric suite that consists of data access metric (DAM), measure
of aggression (MOA), method of functional abstraction (MFA), cohesion among meth-
ods of a class (CAM), and number of public methods (NPM). Afferent coupling (Ca) and
efferent coupling (Ce) metrics proposed by Martin (2002) were also included. Some other
miscellaneous metrics included in the study were inheritance coupling (IC) metric, cou-
pling between methods of a class (CBM), average method complexity (AMC), lines of code
(LOC), and LCOM3 (Henderson version of LCOM) metric. The detailed definition of each
metric can be referred from Chapter 3. These metrics are the independent variables of the
study and measure various OO properties of a software like cohesion, size, coupling, reus-
ability, and so on.
  The objective of the study is to ascertain change-prone classes. Thus, change proneness,
that is, the likelihood of change in a class after the software goes into operation phase is the
dependent variable of the study. To comprehend change in a class, LOC inserted or deleted
from a class is taken into account.
      RBF, REP, SVM-SMO, VP, and ZeroR) do not show significant differences when
      evaluated using specificity measure.
    • Ha alternate hypothesis: Change prediction models developed using all the tech-
      niques (LR, AB, ADT, BG, BN, DTab, J48, RIPPER, LB, MLP, NB, NNge, RF,
      RBF, REP, SVM-SMO, VP, and ZeroR) show significant differences when evalu-
      ated using specificity measure.
  • Hypothesis for precision measure
    • H0 null hypothesis: Change prediction models developed using all the techniques
      (LR, AB, ADT, BG, BN, DTab, J48, RIPPER, LB, MLP, NB, NNge, RF, RBF, REP,
      SVM-SMO, VP, and ZeroR) do not show significant differences when evaluated
      using precision measure.
    • Ha alternate hypothesis: Change prediction models developed using all the tech-
      niques (LR, AB, ADT, BG, BN, DTab, J48, RIPPER, LB, MLP, NB, NNge, RF,
      RBF, REP, SVM-SMO, VP, and ZeroR) show significant differences when evalu-
      ated using precision measure.
  • Hypothesis for F-measure
    • H0 null hypothesis: Change prediction models developed using all the techniques
      (LR, AB, ADT, BG, BN, DTab, J48, RIPPER, LB, MLP, NB, NNge, RF, RBF, REP,
      SVM-SMO, VP, and ZeroR) do not show significant differences when evaluated
      using F-measure.
    • Ha alternate hypothesis: Change prediction models developed using all the tech-
      niques (LR, AB, ADT, BG, BN, DTab, J48, RIPPER, LB, MLP, NB, NNge, RF,
      RBF, REP, SVM-SMO, VP, and ZeroR) show significant differences when evalu-
      ated using F-measure.
  • Hypothesis for AUC measure
    • H0 null hypothesis: Change prediction models developed using all the techniques
      (LR, AB, ADT, BG, BN, DTab, J48, RIPPER, LB, MLP, NB, NNge, RF, RBF, REP,
      SVM-SMO, VP, and ZeroR) do not show significant differences when evaluated
      using AUC performance measure.
    • Ha alternate hypothesis: Change prediction models developed using all the tech-
      niques (LR, AB, ADT, BG, BN, DTab, J48, RIPPER, LB, MLP, NB, NNge, RF,
      RBF, REP, SVM-SMO, VP, and ZeroR) show significant differences when evalu-
      ated using AUC performance measure.
     TABLE 11.1
     Software Data Set Details
     Software Name       Versions Analyzed     No. of Data Points   % of Changed Classes
Telephony
MMS
Gallery2
Calendar
Contacts
Bluetooth
FIGURE 11.1
Change statistics of all data sets.
  • REP Tree: It is a fast decision tree technique that reduces variance (Mohamed et al.
    2012).
  • SVM: This technique effectively handles high dimensionality, redundant features,
    and complex functions. It is robust in nature (Malhotra 2014).
  • VP: This technique is comparable in terms of accuracy to SVM. However, lit-
    erature claims that the technique has better learning and prediction speed as
    compared to the traditional SVM technique (Freund and Schapire 1999; Sassano
    2008).
Some other techniques such as DTab, NNge, RBF, and ZeroR were also selected for analyz-
ing their performances.
  • The descriptive statistics of all the data sets are collected and analyzed.
  • Next, identify all the outliers in a particular data set and remove them. The
    change prediction models were developed with the remaining data points using
    18 techniques.
  • Next, to reduce the dimensionality of the input data set, use correlation-based
    feature selection (CFS) method and identify the important metrics for each corre-
    sponding data set. This step eliminates the noisy and redundant features of each
    data set.
  • Now develop models using all the 18 techniques on the six selected data sets using
    tenfold cross-validation method. The change prediction models developed by all
    the techniques are evaluated using six performance measures.
  • Analyze the developed models using Friedman statistical test and evaluate the
    developed hypothesis.
  • Finally, perform Nemenyi post hoc test to find the pairs of techniques that are
    statistically significantly different from each other.
features and eliminates them before model development. This helps in improving the
results of the model.
      grow, prune, and optimize. The parameter settings for the technique in the WEKA
      tool were three folds, two optimizations, seed as one, and use of pruning as true.
  •   LB: It is a boosting technique that uses additive LR (Friedman et  al. 2000). The
      technique uses a likelihood threshold of −1.79, weight threshold of 100, 10 itera-
      tions, a shrinkage parameter of 1, and reweighing as the parameter settings in the
      WEKA tool.
  •   MLP: It is based on the functioning of nervous system and is capable of modeling
      complex relationships with ease. Apart from an input and an output layer, they
      have a number of intermediate hidden layers. Synaptic weights are assigned and
      adjusted in these intermediate layers with back propagation training algorithm.
      The technique comprises of two passes: forward as well backward. The backward
      pass produces an error signal that helps in reducing the difference between actual
      and desired output (Haykin 1998). WEKA uses a learning rate of 0.005 and sig-
      moid function for transfer. The number of hidden layers was set as 1.
  •   NB: This ML algorithm is based on Bayes theorem and creates a probabilistic
      model for prediction. All the features of the technique are treated independently,
      and it uses only a small training set for developing classification models. The
      default settings of WEKA uses kernel estimator and supervised discretization as
      false for NB.
  •   NNge: This technique uses NNge that are hyperrectangles. These can be viewed as
      if rules. The parameter settings were five attempts for generalization and five fold-
      ers for mutual information in WEKA tool.
  •   RF: It is composed of a number of tree predictors. The tree predictors are based
      on a random vector, which is sampled independently with the same distribu-
      tion. The output class of RF is the mode of all the individual trees in the for-
      est (Breiman 2001). RF is advantageous because of its noise robustness, parallel
      nature, and fast learning. The RF were used with 10 trees as parameter settings
      in the WEKA tool.
  •   RBF network: This technique is an implementation of normalized Gaussian RBF
      network. To derive the centers and widths of the hidden layer, the algorithm uses
      m-means. The LR technique is used to combine the outputs from the hidden units.
      The technique uses a parameter settings of two clusters and one clustering seed in
      the WEKA tool.
  •   REP: It is a fast decision tree learning technique that uses information gain or
      variance reduction to build a decision tree. The technique performs reduced
      error pruning with backfitting. The default parameter settings for this technique
      in WEKA tool were as seed of 1, maximum depth of −1, minimum variance of
      0.001, and 3 number of folds.
  •   Support vector machine (SVM): It aims to construct an optimal hyperplane that
      can efficiently separate the new instances into two separate categories (Cortes and
      Vapnik 1995). WEKA uses sequential minimal optimization algorithm to train the
      SVM. The parameter settings used by the technique in WEKA tool were random
      seed of 1, tolerance parameter of 0.01, a c value of 1.0, and an epsilon value of 1.0E-
      12 and a polykernel.
  •   VP: It is a technique that is based on the Rosenblatt Frank’s (Frank 1958) perceptron
      technique. The technique can be effectively used in high-dimensional spaces with
404                                               Empirical Research in Software Engineering
    the use of kernel functions. The parameter settings for the technique in WEKA
    tool were an exponent value of 1 and a seed of 1.
  • ZeroR: It is a technique that uses 0-R classifier. The technique predicts the mean if
    the class is numeric or the mode if the class is nominal.
The change prediction models were developed using tenfold cross-validation method.
The tenfold cross-validation method involves division of data points into 10 partitions,
where nine subsets are used for training the model while the tenth partition is used for
model validation. A total of 10 iterations are performed, each time using a different set as
the validation set (Stone 1974).
TABLE 11.2
Descriptive Statistics for Android Bluetooth Data Set
Metric Name       Min.      Max.       Mean             SD      Percentile (25%)   Percentile (75%)
TABLE 11.3
Descriptive Statistics for Android Contacts Data Set
Metric Name      Min.        Max.        Mean           SD      Percentile (25%)   Percentile (75%)
TABLE 11.4
Descriptive Statistics for Android Calendar Data Set
Metric Name       Min.      Max.        Mean           SD       Percentile (25%)   Percentile (75%)
TABLE 11.5
Descriptive Statistics for Android Gallery2 Data Set
Metric Name      Min.       Max.       Mean            SD       Percentile (25%)   Percentile (75%)
TABLE 11.6
Descriptive Statistics for Android MMS Data Set
Metric Name       Min.       Max.       Mean        SD         Percentile (25%)   Percentile (75%)
TABLE 11.7
Descriptive Statistics for Android Telephony Data Set
Metric Name      Min.       Max.       Mean         SD         Percentile (25%)   Percentile (75%)
                        TABLE 11.8
                        Outlier Details
                        Data Set Name                 Number of Outliers
                        Bluetooth                              7
                        Contacts                              12
                        Calendar                               6
                        Gallery2                              43
                        MMS                                   22
                        Telephony                             37
Demonstrating Empirical Procedures                                                         409
                 TABLE 11.9
                 Metrics Selected by CFS Method
                 Data Set Name                       Metrics Selected
  TABLE 11.10
  Validation Results Using Accuracy Performance Measure
  Technique     Bluetooth        Contacts     Calendar      Gallery2    MMS      Telephony
     techniques were achieved over the Android Contacts data set. However, the tech-
     nique that developed the model exhibiting best accuracy values differed in all
     the data sets. The change prediction model developed using the SVM technique
     achieved the best accuracy value of 83.07% on Android Bluetooth data set and
     83.00% on Android Calendar data set. The MLP technique gave the best accuracy
     value (75.25%) on Android Contacts data set and the LB technique (accuracy value:
     75.72%) on the Android MMS data set. The best accuracy measures on Gallery2 and
     Telephony data sets were given by NNge and LR technique with an accuracy value
     of 64.04% and 68.86%, respectively.
  2. Validation results using sensitivity measure
     Table 11.11 is a representation of sensitivity values of change prediction models
     developed by all the techniques of the study on six Android data sets. We can
410                                              Empirical Research in Software Engineering
  TABLE 11.11
  Validation Results Using Sensitivity Performance Measure
  Technique      Bluetooth     Contacts     Calendar         Gallery2   MMS      Telephony
  AB               63.63         75.28        41.17           61.71     76.59a     63.77
  ADT              72.72         74.15a       47.05           60.15     70.21      64.56
  BG               72.72         75.28a       58.82           60.15     74.46      62.20
  BN               63.63         75.28a       11.76           60.93     72.34      64.56
  DTab             63.63         75.28a       47.05           59.37     74.46      63.77
  J48              63.63         69.66        52.94           60.15     74.46a     62.20
  RIPPER           63.63         69.66a       52.94           60.15     51.06      54.33
  LR               72.72         73.03        52.94           64.06     76.59a     68.50
  LB               63.63         75.28a       47.05           61.71     74.46      67.71
  MLP              72.72         76.40a       52.94           60.93     74.46      64.56
  NB               72.72a        71.91        52.94           62.50     72.34      66.92
  NNge             18.18         69.66a       35.29           50.00     42.55      65.35
  RF               63.63         74.15a       52.94           64.84     65.95      58.26
  RBF              81.81a        69.66        35.29           59.37     74.46      66.92
  REP              27.27a        73.03a       70.58           54.68     61.70      65.35
  SVM               0.00         42.69         0.00           10.15     14.89      88.18a
  VP               36.36        100.00a        0.00           25.78     19.14       0.00
  ZeroR            36.36         89.88a       17.64           28.90     57.44      48.81
     again see that the most number of best sensitivity values of different techniques
     are achieved over Android Contacts data set. The RBF technique (81.81%) and
     the VP technique (100%) achieved the best sensitivity values on Bluetooth and
     Contacts data sets, respectively. The REP technique and the RF technique gave
     best sensitivity values over the Calendar and Gallery2 data sets with sensitivity
     values of 70.58% and 64.84%, respectively. The LR technique and the AB technique
     both gave a sensitivity value of 76.59% on MMS data set. The best performing
     sensitivity value on Telephony data set was given by the SVM technique (88.18%).
  3. Validation results using specificity measure
     The specificity values of all the change prediction models of all the six android
     data sets is depicted in Table 11.12. The Bluetooth and Contacts data set showed the
     most number of techniques with best performing change prediction models when
     evaluated using specificity measure. The SVM model gave the best specificity
     values in all the data sets except Telephony data set. The VP technique gave the
     best specificity value on Telephony data set.
  4. Validation results using precision measure
     Table 11.13 shows the precision values of all the change prediction models devel-
     oped on six Android data sets using 18 techniques. According to the table, the
     Telephony data set showed the best precision values on majority of techniques.
     The best precision value was exhibited by the RBF, SVM, NNge, VP, SVM, and
     LR techniques on Android Bluetooth, Contacts, Calendar, Gallery2, MMS, and
     Telephony data sets, respectively (precision measures—Bluetooth: 45%, Contacts:
     71.69%, Calendar: 33.33%, Gallery2: 71.77%, MMS: 63.63%, Telephony: 76.99%).
Demonstrating Empirical Procedures                                                          411
  TABLE 11.12
  Validation Results Using Specificity Performance Measure
  Technique      Bluetooth     Contacts      Calendar        Gallery2   MMS      Telephony
  TABLE 11.13
  Validation Results Using Precision Performance Measure
  Technique      Bluetooth     Contacts     Calendar         Gallery2   MMS      Telephony
  TABLE 11.14
  Validation Results Using F-Measure Performance Measure
  Technique     Bluetooth     Contacts     Calendar        Gallery2   MMS     Telephony
  TABLE 11.15
  Validation Results Using AUC Performance Measure
  Technique                                  Bluetooth         Contacts           Calendar            Gallery2    MMS       Telephony
                                        90
                                        80
             Performance measures (%)
                                        70
                                        60
                                        50
                                        40
                                        30
                                        20
                                        10
                                         0
                                                RBF                NB           LR              MLP         BG         BN
                                                                                   Techniques
                                                        Accuracy            Precision           F-measure        AUC
FIGURE 11.2
Top six performing techniques on Android Bluetooth data set.
414                                                                        Empirical Research in Software Engineering
                                      82
           Performance measures (%)   80
                                      78
                                      76
                                      74
                                      72
                                      70
                                      68
                                      66
                                      64
                                      62
                                           MLP         LB       BG         REP           DTab           BN
                                                                  Techniques
                                                  Accuracy     Precision         F-measure         AUC
FIGURE 11.3
Top six performing techniques on Android Contacts data set.
                                      90
                                      80
           Performance measures (%)
                                      70
                                      60
                                      50
                                      40
                                      30
                                      20
                                      10
                                      0
                                           Nnge        BG       NB            MLP            LR         RF
                                                                  Techniques
                                                    Accuracy   Precision         F-measure        AUC
FIGURE 11.4
Top six performing techniques on Android Calendar data set.
Figure 11.3 shows that MLP, LB, BG, REP, DTab, and BN techniques develop the top six
performing models on Android Contacts data set. The AUC value of the model developed
by the MLP technique was as high as 80.9%. The accuracy, precision, and F-measure values
for the MLP model were 75%, 73%, and 70%, respectively.
Demonstrating Empirical Procedures                                                                415
                                      80
           Performance measures (%)   70
60
50
40
30
20
10
                                       0
                                           LB        AB     LR         ADT           BG      BN
                                                              Techniques
                                                Accuracy   Precision         F-measure     AUC
FIGURE 11.5
Top six performing techniques on Android Gallery2 data set.
                                      90
                                      80
           Performance measure (%)
                                      70
                                      60
                                      50
                                      40
                                      30
                                      20
                                      10
                                       0
                                           LB        LR    MLP          BG           RBF     NB
                                                              Techniques
                                                Accuracy   Precision         F-measure     AUC
FIGURE 11.6
Top six performing techniques on Android MMS data set.
  According to Figure  11.4, the techniques developing the best performing change
prediction models on the Android Calendar data set were NNge, BG, NB, MLP, LR, and RF.
The highest accuracy value was shown by the NNge model as 77%, while all other models
showed accuracy values between 52% and 55%. The AUC % of all the models ranged from
51% to 60%.
  Figure 11.5 depicts the top six techniques that gave the best change prediction models
on the Android Gallery2 data set as LB, AB, LR, ADT, BG, and BN. However, as shown in
the figure, there was not much difference in the values of different techniques. While the
accuracy values ranged from 61% to 63%, the precision values ranged from 50% to 52%, the
F-measure values ranged from 55% to 57%, and the AUC % values ranged from 62% to 69%.
Similar results were shown by Android MMS data set, as shown in Figure 11.6. The top
six best ranking techniques were LB, LR, MLP, BG, RBF, and NB. The AUC values for the
416                                                                Empirical Research in Software Engineering
85
                                      80
           Performance measures (%)
75
70
65
60
                                      55
                                           LR    SVM       LB          NB             RBF         MLP
                                                             Techniques
                                                Accuracy   Precision      F-measure         AUC
FIGURE 11.7
Top six performing techniques on Android Telephony data set.
MMS data set ranged from 72% to 76%, the precision values from 50% to 54%, the F-measure
values from 59% to 61%, and the AUC % values from 79% to 81%.
  Figure 11.7 shows the top six performing models on Android Telephony data set. The
techniques used for developing these models were LR, SVM, LB, NB, RBF, and MLP. The
change prediction model developed by the LR technique gave the best results with an
accuracy value of 69%, a precision value of 77%, a F-measure value of 72%, and an AUC %
of 73%.
                     TABLE 11.16
                     Friedman Mean Ranks Using Accuracy Measure
                     Technique     Mean Rank   Technique   Mean Rank
                     TABLE 11.17
                     Friedman Mean Ranks Using Sensitivity Measure
                     Technique     Mean Rank   Technique   Mean Rank
                     TABLE 11.18
                     Friedman Mean Ranks Using Precision Measure
                     Technique     Mean Rank   Technique   Mean Rank
     techniques were LR and MLP. The worst performing techniques were VP and
     ZeroR. The p-value of 0.001 is much less than 0.05, so we reject the null hypoth-
     esis. The Friedman statistic was computed as 41.41. Hence, the precision values
     of change prediction models developed using different techniques differ signifi-
     cantly. Thus, the results of all the techniques are behaviorally different when eval-
     uated using precision measure.
  5. Testing hypothesis for F-measure
     To evaluate null hypothesis, we performed Friedman test using F-measure values.
     Table 11.19 presents the mean ranks obtained by all the techniques when we used
     F-measure for evaluating the various techniques. The LR technique and the MLP
     technique gave the best results with mean ranks of 3.50  and 4.42,  respectively.
     The least effective techniques in terms of F-measure values were ZeroR and VP
     techniques. The p-value for the Friedman test was <0.05,  indicating acceptance
     of alternate hypothesis. The Friedman statistic value was obtained as 53.661. The
     results show that change prediction models developed using all the techniques
     show significant differences when evaluated using F-measure values.
  6. Testing hypothesis for AUC measure
     Table 11.20 presents the mean ranks of all the techniques when the change predic-
     tion models developed by them on all the six data sets are evaluated using the
                     TABLE 11.19
                     Friedman Mean Ranks Using F-Measure
                     Technique     Mean Rank   Technique   Mean Rank
                     TABLE 11.20
                     Friedman Mean Ranks Using AUC Measure
                     Technique     Mean Rank   Technique   Mean Rank
                     LR              3.58       BN            9.67
                     MLP             3.75       DTab         10.00
                     NB              3.92       J48          11.58
                     BG              4.08       NNge         12.25
                     LB              4.67       RIPPER       13.42
                     RF              7.58       REP          13.42
                     ADT             8.33       SVM          13.67
                     RBF             8.67       VP           15.83
                     AB              8.75       ZeroR        17.83
     AUC measure. The LR and the MLP techniques again achieved the top two ranks,
     while the VP and the ZeroR technique gave the worst results. The Friedman sta-
     tistic value was calculated as 69.115  with a p-value much less than 0.05. Thus,
     we accept the alternate hypothesis, which indicates statistically significant dif-
     ferences in the performance of change prediction models developed by all the
     techniques.
The critical distance computed for Nemenyi test is as follows (Demšar 2006):
                                               k(k + 1)
                                     CD = qα
                                                 6n
Here, k corresponds to the number of techniques, which is 18 in this study, and n cor-
responds to the number of data sets, which is six. The critical values (qα) are studentized
420                                             Empirical Research in Software Engineering
    For example, the RBF technique gave the best F-measure value for change pre-
    diction model on Android Bluetooth data set, and the MLP technique gave the
    best F-measure value for change prediction model on Android Contacts data set.
    However, the NNge technique, the LR technique, the MLP technique, and the
    SVM technique gave the best F-measure values for Android Calendar, Gallery2,
    MMS, and Telephony data sets, respectively. Moreover, Figures  11.2 through
    11.7 clearly show that the top-performing techniques on each data set differ if
    we take into account their cumulative performance. An analysis of Figures 11.2
    through 11.7 indicate that LR, MLP, and BG are high-performing techniques as
    they rank among top six techniques in five of the six data sets used in the study.
    Other good performing techniques include NB and LB, which rank among top
    six technique in four out of the six data sets of the study. Certain techniques
    (SVM, ADT, AB, RF, DTab, REP, NNge) gave good results in only one of the
    data sets. These techniques may be influenced by certain characteristics of a
    particular data set. However, we need to perform more such studies to actu-
    ally evaluate which type of techniques get influenced by the characteristics of a
    data set.
  RQ3: What is the comparative performance of different techniques when we take
    into account different performance measures?
    To answer this question, we formulated research hypotheses given in Section 11.4.4.1.
    The hypothesis testing was done with the help of Friedman test. The results of the
    study indicate that we reject the null hypothesis for all the selected performance
    measures. Thus, the results of change prediction models developed using different
    techniques were significantly different from each other when evaluated using accu-
    racy, sensitivity, specificity, precision, F-measure, and AUC performance measures.
    Tables 11.16 through 11.20 show Friedman mean ranks of various techniques using
    different performance measures. As can be seen, the LR technique gave the best
    results using all the performance measure except accuracy. Thus, we conclude that
    the LR technique is an effective technique for developing high-performing change
    prediction models. Also, the MLP technique is a good ML technique for developing
    models that predict change-prone classes as it achieves good Friedman ranks in all
    performance measures.
    The results show that the best performing ML technique for the development of
    change prediction models is MLP. As can be seen from Tables 11.16 through 11.20,
    the MLP technique received the best ranks after LR technique except in the case of
    accuracy measure. The results show that although the accuracy measure has pre-
    dicted all outcome classes for model predicted using SVM technique as not change
    prone (no predictive ability), but the Friedman test ranks the SVM technique as the
    best in terms of measuring accuracy. This is because of the presence of imbalance
    values of the outcome variable in the data sets. In imbalanced data sets, there are
    less change-prone classes as compared to non-change-prone classes. For example,
    for Bluetooth and Calendar data sets the change-prone classes are only 19%, and
    for MMS data set, the change-prone classes are 30%. The SVM technique predicted
    all classes as not change prone, hence, the specificity and accuracy values were
    very high specifically for Bluetooth and Calendar data sets (more than 80%) and
    thus contributed toward high ranking of SVM in terms of accuracy. The accu-
    racy measure gives false results when the data is imbalanced and the technique
422                                                                           Empirical Research in Software Engineering
12
                                        0
                                            Accuracy   Sensitivity        Precision        F-measure   AUC
                                                                     Performance measure
FIGURE 11.8
Summary of Nemenyi test.
    classifies the classes into one single outcome category. Hence, this study does not
    base the results interpretation in light of accuracy measure.
  RQ4: Which pairs of techniques are statistically significantly different  from each
    other for developing change prediction models?
    To answer this research question, we performed Nemenyi post hoc test among
    all possible pairs of techniques in the study. The test was performed using
    accuracy, sensitivity, precision, F-measure, and AUC values among 153  pairs
    of techniques. Figure  11.8  depicts the number of pairs of techniques that
    showed significantly different results using different performance measures.
    According  to the figure, two  pairs of techniques showed significant results
    using  accuracy  measure, but no pair of technique showed significant results
    using the sensitivity measure. On evaluation of precision, F-measure, and AUC
    measures, four, eight,  and ten  pairs of techniques, respectively, were signifi-
    cantly different from each other. Only one pair of technique showed significant
    Nemenyi results on four performance measures, namely, accuracy, precision,
    F-measure, and AUC, which was LR–ZeroR. On analyzing the pairs with sig-
    nificant differences, it can be seen that ZeroR and VP techniques are statistically
    significantly different from a number of other techniques like LR, BG, LB, MLP,
    and NB.
  RQ5: What is the comparative performance of ML techniques with the statistical
    technique LR?
    The LR technique has been ranked higher in most of the performance measures
    followed by MLP. The performance of model predicted using the ML techniques
    was comparable to the model predicted using the LR technique.
  RQ6: Which ML technique gives the best performance for developing change predic-
    tion models?
Demonstrating Empirical Procedures                                                          423
similar studies to explore different programming languages and other project character-
istics to yield generalizable conclusions.
More studies in the future should be conducted to evaluate different statistical and ML
algorithms using other performance measures such as H-measure, G-measure, and so on.
Also, future studies can incorporate evolutionary computation techniques such as artifi-
cial immune systems and genetic algorithms for developing change prediction models.
Demonstrating Empirical Procedures                                                          425
Appendix
    TABLE 11A.1
    Nemenyi Test Results
    Pair No.   Algorithm Pair   Accuracy   Sensitivity   Precision   F-Measure      AUC
    1.            AB–ADT          1.67        1.25         1.75        1.58         0.42
    2.            AB–BG           2.33        0.83         3.33        2.33         4.67
    3.            AB–BN           1.42        1.92         1.17        1.08         0.92
    4.            AB–DTab         0.75        1.67         0.33        0.42         1.25
    5.            AB–J48          5.33        2.34         4.25        4.17         2.83
    6.            AB–RIPER        5.00        4.34         3.83        4.25         4.67
    7.            AB–LR           4.17        3.16         4.42        4.25         5.17
    8.            AB–LB           3.33        1.08         3.25        2.75         4.08
    9.            AB–MLP          3.42        2.00         3.83        3.33         5.00
    10.           AB–NB           0.83        1.00         1.00        0.83         4.83
    11.           AB–Nnge         1.92        6.84         0.33        3.92         3.50
    12.           AB–RF           3.33        1.25         3.42        1.83         1.17
    13.           AB–RBF          0.08        1.17         0.25        0.42         0.08
    14.           AB–REP          2.83        2.75         1.92        2.92         4.67
    15.           AB–SVM          4.25        7.34         0.33        7.33         4.92
    16.           AB–VP           5.08        6.42         5.33        8.83         7.08
    17.           AB–ZeroR        6.58        5.34         8.25        8.25         9.08
    18.           ADT–BG          4.00        2.08         5.08        3.91         4.25
    19.           ADT–BN          3.09        0.67         0.58        0.50         1.34
    20.           ADT–DTab        0.92        0.42         1.42        1.16         1.67
    21.           ADT–J48         3.66        1.09         2.50        2.59         3.25
    22.           ADT–RIPER       3.33        3.09         2.08        2.67         5.09
    23.           ADT–LR          5.84        4.41         6.17        5.83         4.75
    24.           ADT–LB          5.00        2.33         5.00        4.33         3.66
    25.           ADT–MLP         5.09        3.25         5.58        4.91         4.58
    26.           ADT–NB          2.50        2.25         2.75        2.41         4.41
    27.           ADT–Nnge        3.59        5.59         2.08        2.34         3.92
    28.           ADT–RF          1.66        0.00         1.67        0.25         0.75
    29.           ADT–RBF         1.59        0.08         2.00        1.16         0.34
    30.           ADT–REP         1.16        1.50         0.17        1.34         5.09
    31.           ADT–SVM         5.92        6.09         2.08        5.75         5.34
    32.           ADT–VP          3.41        5.17         3.58        7.25         7.50
    33.           ADT–ZeroR       4.91        4.09         6.50        6.67         9.50
    34.           BG–BN           0.91        2.75         4.50        3.41         5.59
    35.           BG–DTab         3.08        2.5          3.66        2.75         5.92
    36.           BG–J48          7.66        3.17         7.58        6.50         7.50
    37.           BG–RIPER        7.33        5.17         7.16        6.58         9.34
    38.           BG–LR           1.84        2.33         1.09        1.92         0.50
    39.           BG–LB           1.00        0.25         0.08        0.42         0.59
    40.           BG–MLP          1.09        1.17         0.50        1.00         0.33
    41.           BG–NB           1.50        0.17         2.33        1.50         0.16
                                                                              (Continued)
426                                              Empirical Research in Software Engineering
There are many statistical packages available to implement the concepts given in previous
chapters. These statistical packages or tools can assist researchers and practitioners to
perform operations such as summarizing data, preselecting attributes, hypothesis testing,
model creation, and validation. There are various statistical tools available in the market
such as SAS, R, Matrix Laboratory (MATLAB®*), SPSS, Stata, and Waikato Environment
for Knowledge Analysis (WEKA). An overview and comparison of these tools will help
in making decision about selection of an appropriate tool in assisting the research process.
In this chapter, we provide an overview of five tools, namely, WEKA, Knowledge Extraction
based on Evolutionary Learning (KEEL), SPSS, MATLAB, and R, and summarize their
characteristics and available statistical procedures.
12.1 WEKA
WEKA tool was developed at the University of Waikato in New Zealand (http://www
.cs.waikato.ac.nz/ml/weka/) and is distributed under GNU public license. The tool was
developed in Java language and runs on a number of platforms, be it Linux, Macintosh,
or Windows. It provides an easy-to-use interface for using a number of different learning
techniques. Moreover, it also provides various methods for preprocessing or postprocess-
ing data. Research has seen wide application of WEKA tool for analyzing the results of
different techniques on different data sets. WEKA can be used for multiple purposes, be it
analyzing the results of a classification method on data or developing models to obtain pre-
dictions on new data or comparing the performances of several classification techniques.
12.2 KEEL
KEEL is a software tool which was developed using the Java language. The tool is open
source in nature and aids the user for easy assessment of a number of evolutionary
and other soft computing techniques. The tool provides a framework for designing a
*   MATLAB® is a registered trademark of The MathWorks, Inc. For product information, please contact:
    The MathWorks, Inc.
    3 Apple Hill Drive
    Natick, MA 01760-2098 USA
    Tel: +1 508 647 7000
    Fax: +1 508 647 7001
    E-mail: info@mathworks.com
    Web: www.mathworks.com
                                                                                                        429
430                                                Empirical Research in Software Engineering
number of experiments for various data mining tasks such as classification, regression,
and pattern mining.
   KEEL software tool consists of an extensive number of features that can help researchers
and students to perform various data mining tasks in an easy and effective manner.
The tool provides a convenient and user-friendly interface to conduct and design various
experiments. It also incorporates a library of in-built data sets. KEEL specializes in the use
of evolutionary algorithms that can be effectively used for model prediction, preprocessing
tasks, and certain postprocessing tasks. The tool also incorporates a number of data pre-
processing algorithms for various tasks like noisy data filtering, selection of training sets,
discretization, and data transformation among others. It also enables effective analysis and
comparisons of results with the help of statistical library. The experiments designed using
KEEL can be run both in an offline mode on other or same machine at a later time or an
online mode. The tool is designed for two specific types of users: a researcher or a student.
It facilitates the researcher by easy automation of experiments and effective result analysis
using statistical library. It is a useful learning tool for students as a student can have real-
time view of an technique’s evolving process with visual feedback (Alcala et al. 2011).
12.3 SPSS
SPSS statistics is a software package that is used for statistical analysis. It was acquired
by IBM in 2009 and the current versions (2014) are officially named IBM SPSS Statistics.
The software name stands for Statistical Package for Social Sciences, which reflects the
original market.
  SPSS is one of the most powerful tools that can be used for carrying out almost any
type of data analysis. This data analysis could be either in the field of social sciences,
natural sciences, or in the world of business and management. This tool is widely used
for research and interpretation of the data. It performs four major functions: creates and
maintains a data set, analyzes data, produces results after analysis, and graphs them. This
tutorial focuses on the main functions and utilities that can be used by a researcher for
performing various empirical studies.
12.4 MATLAB®
MATLAB is a high-performance interactive software system that integrates computation
and visualization for technical computations and graphics. MATLAB was primarily devel-
oped by Cleve Moler in the 1970s. The tool is derived from two FORTRON’s subroutine,
namely, EISPACK and LINPACK. EISPACK is an eigenvalue system and LINPACK is a
linear system. The package was rewritten in 1980s in C. This rewritten package had larger
functionality and a number of plotting routines. To commercialize MATLAB and further
develop it, the MathWorks Inc. was created in 1984.
Tools for Analyzing Data                                                                   431
  MATLAB is specially designed for matrix computations to solve linear equations, factor
matrices, and compute eigenvalues and eigenvectors. In addition, it has sophisticated
graphical features that are extendable. MATLAB also provides a good programming
language environment as it offers many facilities like editing and debugging tools, data
structures, and supports object-oriented paradigms.
  MATLAB provides a number of built-in routines that aid extensive computations.
Also, the results are immediately visualized with the help of easy graphical commands.
A MATLAB toolbox consists of a collection of specific applications. There are a number
of toolboxes for various tasks such as simulation, symbolic computation, signal process-
ing, and many other related tasks in the field of science and engineering. These factors
make MATLAB an excellent tool and it is used at most universities and industries world-
wide for teaching and research. However, MATLAB has some weaknesses as it is designed
for scientific computing, commands are specific for its usage and is not suitable for other
applications like a general purpose programming language C or C++. It is an interpreted
language and is therefore slower than compiled language. Mathematica, Scilab, and GNU
Octave are some of the competitors of MATLAB.
12.5 R
R was developed by  Ross Ihaka  and  Robert Gentleman  at the  University of Auckland,
New Zealand. It is freely available under the GNU General Public License and can be used
with various operating systems.
  R is a well-developed, simple, and effective programming language extensively used by
the statisticians for statistical computing and data analysis. In addition to this, R includes
facilities for data calculation and manipulation, various operators for working with arrays
(or matrices), tools and graphical facilities for data analysis, input and output facilities,
and so on.
TABLE 12.1
Comparison of Tools
          Operating                               Open                             Availability
Tools      System        License      Interface   Source     Lang.      Grap.        of Help               Link                 Speciality
WEKA         Mac/       GNU GPL       Syntax/     Yes      Java        Excellent   Good           www.cs.waikato.         Used for machine
              Windows                  Menu                                                        ac.nz/~ml/weka          learning techniques
KEEL         Mac/       GNU GPL       Menu        Yes      Java        Moderate    Moderate       www.keel.es             Used for evolutionary
              Windows                                                                                                      algorithms
SPSS         Mac/       Proprietary   Syntax/     No       Java        Very        Good           www.ibm.com/software/   Used for multivariate
              Windows                  Menu                             good                       analytics/spss/         analysis and statistical
                                                                                                                           testing
MATLAB       Mac/       Proprietary   Syntax/     No       C++/Java    Good        Very good      http://in.mathworks.    Best for developing
              Windows                  Menu                                                        com/products/matlab/    new mathematical
                                                                                                                           techniques, used for
                                                                                                                           image and signal
                                                                                                                           processing
R            Mac/       GNU GPL       Syntax      Yes      Fortron/C   NA          Average        www.r-project.org       Extensive library
              Windows                                                                                                      support
                                                                                                                                                      Empirical Research in Software Engineering
Tools for Analyzing Data                                                                 433
          TABLE 12.2
          Parameter Comparison
          Parameters                            WEKA   MATLAB       KEEL   SPSS     R
          Correlation                            Y       Y           N      Y      Y
          Normality tests                        N       Y           N      Y      Y
          Descriptive Statistics
          Minimum                                Y       Y           Y      Y      Y
          Maximum                                Y       Y           Y      Y      Y
          Variance                               N       Y           Y      Y      Y
          Standard deviation                     Y       Y           N      Y      Y
          Skewness                               N       Y           N      Y      Y
          Kurtosis                               N       Y           N      Y      Y
          Mean                                   Y       Y           Y      Y      Y
          Median                                 N       Y           N      Y      Y
          Mode                                   N       Y           N      Y      Y
          Quartiles                              N       Y           N      Y      Y
          Feature Selection                                      
          Correlation-based feature selection    Y       N           N      N      Y
          Principal component analysis           Y       Y           Y      Y      Y
           Evolutionary Algorithms                                
           Genetic algorithm                     Y         Y          Y      N     Y
           Genetic programming                   Y         Y          N      N     Y
           Ant colony optimization               N         Y          Y      N     Y
           Ant miner                             N         N          Y      N     N
           Multi-objective particle swarm        Y         Y          N      N     Y
            optimization
           Artificial immune system              Y         N          N      N     N
           Particle swarm optimization linear    N         N          Y      N     N
            discriminant analysis
           Constricted particle swarm            N         N          Y      N     N
            optimization
           Hierarchical decision rules           N         N          Y      N     N
           Decision trees with genetic           N         N          Y      N     N
            algorithms
           Neural net evolutionary               N         N          Y      N     N
            programming
           Genetic algorithms with neural        N         N          Y      N     N
            networks
           Genetic fuzzy system logitboost       N         N          Y      N     N
           Cross-validation                      Y         Y          Y      Y     Y
           ROC curve                             Y         Y          N      Y     Y
Further Readings
The basic use of WEKA tool is described in:
  M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten,
      “The WEKA data mining software: An update,” ACM SIGKDD Explorations
      Newsletter, vol. 11, no. 1, pp. 10–18, 2009.
  I. H. Witten, and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques,
      Morgan Kaufmann, Boston, MA, 2005.
Tools for Analyzing Data                                                                     435
The classic applications and working of SPSS tool are described in:
  S. J. Coakes, and L. Steed, SPSS: Analysis without Anguish Using SPSS Version 14.0 for
      Windows, John Wiley & Sons, Chichester, 2009.
  D. George, SPSS for Windows Step by Step: A Simple Study Guide and Reference,
      17.0 Update, 10/e, Pearson Education, New Delhi, India, 2003.
  S. B. Green, N. J. Salkind, and T. M. Jones, Using SPSS for Windows: Analyzing and
      Understanding Data, Prentice Hall, Upper Saddle River, NJ, 1996.
  S. Landau, and B. Everitt, A Handbook of Statistical Analyses Using SPSS, Chapman &
      Hall, Boca Raton, FL, vol. 1, 2004.
  M. P. Marchant, N. M. Smith, and K. H. Stirling, SPSS as a Library Research Tool, School
      of Library and Information Sciences, Brigham Young University, Provo, UT, 1977.
  M. J. Norušis, SPSS Advanced Statistics: Student Guide, SPSS, Chicago, IL, 1990.
  M. J. Norusis, SPSS 15.0 Guide to Data Analysis, Prentice Hall, Englewood Cliffs, NJ, 2007.
  J. Pallant, SPSS Survival Manual, McGraw-Hill, Maidenhead, 2013.
  S. Sarantakos, A Toolkit for Quantitative Data Analysis: Using SPSS, Palgrave Macmillan,
      New York, 2007.
  http://math.ucsd.edu/~bdriver/21d -s99/matlab-primer.html.
  http://www.mathworks.com/products/demos/.
  J. M. Crawley, Statistics: An Introduction Using R, John Wiley & Sons, England, 2014.
Appendix: Statistical Tables
This appendix contains statistical tables that are required for examples in Chapter 6. We
replicated only a part of the statistical tables used in Chapter 6. To find detailed tables,
readers can refer to any statistical book such as Anderson et al. (2002). The various statisti-
cal tables included in this appendix are as follows:
  •   t-Test
  •   Chi-square test
  •   Wilcoxon–Mann–Whitney test
  •   Area under the normal distribution
  •   F-Test table at 0.05 significance level
  •   Critical values for two-tailed Nemenyi test at 0.05 significance level
  •   Critical values for two-tailed Bonferroni test at 0.05 significance level
        TABLE A.1
        t-Test Table
        Level of significance for one-tailed test
                         0.10             0.05       0.02          0.01            0.005
                                                                                           437
438                                                                                   Appendix
      TABLE A.2
      Chi-Square Table
      Df       0.99            0.95      0.50    0.10        0.05         0.02      0.01
      TABLE A.3
      Wilcoxon–Mann–Whitney Table for N2 = 5
      n2 = 5
      N1                1               2               3             4                5
TABLE A.4
Area Under the Normal Distribution
Z       0.00      0.01     0.02      0.03       0.04      0.05      0.06      0.07     0.08        0.09
−3.9   0.00005   0.00005   0.00004   0.00004   0.00004   0.00004   0.00004   0.00004   0.00003   0.00003
−3.8   0.00007   0.00007   0.00007   0.00006   0.00006   0.00006   0.00006   0.00005   0.00005   0.00005
−3.7   0.00011   0.00010   0.00010   0.00010   0.00009   0.00009   0.00008   0.00008   0.00008   0.00008
−3.6   0.00016   0.00015   0.00015   0.00014   0.00014   0.00013   0.00013   0.00012   0.00012   0.00011
−3.5   0.00023   .00022    0.00022   0.00021   0.00020   0.00019   0.00019   0.00018   0.00017   0.00017
−3.4   0.00034   0.00032   0.00031   0.00030   0.00029   0.00028   0.00027   0.00026   0.00025   0.00024
−3.3   0.00048   0.00047   0.00045   0.00043   0.00042   0.00040   0.00039   0.00038   0.00036   0.00035
−3.2   0.00069   0.00066   0.00064   0.00062   0.00060   0.00058   0.00056   0.00054   0.00052   0.00050
−3.1   0.00097   0.00094   0.00090   0.00087   0.00084   0.00082   0.00079   0.00076   0.00074   0.00071
−3.0   0.00135   0.00131   0.00126   0.00122   0.00118   0.00114   0.00111   0.00107   0.00104   0.00100
−2.9   0.00187   0.00181   0.00175   0.00169   0.00164   0.00159   0.00154   0.00149   0.00144   0.00139
−2.8   0.00256   0.00248   0.00240   0.00233   0.00226   0.00219   0.00212   0.00205   0.00199   0.00193
−2.7   0.00347   0.00336   0.00326   0.00317   0.00307   0.00298   0.00289   0.00280   0.00272   0.00264
−2.6   0.00466   0.00453   0.00440   0.00427   0.00415   0.00402   0.00391   0.00379   0.00368   0.00357
−2.5   0.00621   0.00604   0.00587   0.00570   0.00554   0.00539   0.00523   0.00508   0.00494   0.00480
−2.4   0.00820   0.00798   0.00776   0.00755   0.00734   0.00714   0.00695   0.00676   0.00657   0.00639
−2.3   0.01072   0.01044   0.01017   0.00990   0.00964   0.00939   0.00914   0.00889   0.00866   0.00842
−2.2   0.01390   0.01355   0.01321   0.01287   0.01255   0.01222   0.01191   0.01160   0.01130   0.01101
−2.1   0.01786   0.01743   0.01700   0.01659   0.01618   0.01578   0.01539   0.01500   0.01463   0.01426
−2.0   0.02275   0.02222   0.02169   0.02118   0.02068   0.02018   0.01970   0.01923   0.01876   0.01831
−1.9   0.02872   0.02807   0.02743   0.02680   0.02619   0.02559   0.02500   0.02442   0.02385   0.02330
−1.8   0.03593   0.03515   0.03438   0.03362   0.03288   0.03216   0.03144   0.03074   0.03005   0.02938
−1.7   0.04457   0.04363   0.04272   0.04182   0.04093   0.04006   0.03920   0.03836   0.03754   0.03673
−1.6   0.05480   0.05370   0.05262   0.05155   0.05050   0.04947   0.04846   0.04746   0.04648   0.04551
−1.5   0.06681   0.06552   0.06426   0.06301   0.06178   0.06057   0.05938   0.05821   0.05705   0.05592
−1.4   0.08076   0.07927   0.07780   0.07636   0.07493   0.07353   0.07215   0.07078   0.06944   0.06811
−1.3   0.09680   0.09510   0.09342   0.09176   0.09012   0.08851   0.08691   0.08534   0.08379   0.08226
−1.2   0.11507   0.11314   0.11123   0.10935   0.10749   0.10565   0.10383   0.10204   0.10027   0.09853
−1.1   0.13567   0.13350   0.13136   0.12924   0.12714   0.12507   0.12302   0.12100   0.11900   0.11702
−1.0   0.15866   0.15625   0.15386   0.15151   0.14917   0.14686   0.14457   0.14231   0.14007   0.13786
−0.9   0.18406   0.18141   0.17879   0.17619   0.17361   0.17106   0.16853   0.16602   0.16354   0.16109
−0.8   0.21186   0.20897   0.20611   0.20327   0.20045   0.19766   0.19489   0.19215   0.18943   0.18673
−0.7   0.24196   0.23885   0.23576   0.23270   0.22965   0.22663   0.22363   0.22065   0.21770   0.21476
−0.6   0.27425   0.27093   0.26763   0.26435   0.26109   0.25785   0.25463   0.25143   0.24825   0.24510
−0.5   0.30854   0.30503   0.30153   0.29806   0.29460   0.29116   0.28774   0.28434   0.28096   0.27760
−0.4   0.34458   0.34090   0.33724   0.33360   0.32997   0.32636   0.32276   0.31918   0.31561   0.31207
−0.3   0.38209   0.37828   0.37448   0.37070   0.36693   0.36317   0.35942   0.35569   0.35197   0.34827
−0.2   0.42074   0.41683   0.41294   0.40905   0.40517   0.40129   0.39743   0.39358   0.38974   0.38591
−0.1   0.46017   0.45620   0.45224   0.44828   0.44433   0.44038   0.43644   0.43251   0.42858   0.42465
−0.0   0.50000   0.49601   0.49202   0.48803   0.48405   0.48006   0.47608   0.47210   0.46812   0.46414
 0.0   0.50000   0.50399   0.50798   0.51197   0.51595   0.51994   0.52392   0.52790   0.53188   0.53586
                                                                                              (Continued)
440                                                                                              Appendix
   TABLE A.5
   F-Test Table at 0.05 Significance Level
   ν1
   ν2          1           2             3             4             5              6            7            8          9
   TABLE A.6
   Critical Values for Two-Tailed Nemenyi Test at 0.05 Significance Level
   Number of Subjects             2             3             4             5           …         ...         9          10
   q0.10                       1.645         2.052         2.291          2.459          .        .      2.855         2.920
   q0.05                       1.960         2.344         2.569          2.728          .        .      3.102         3.164
   q0.01                       2.576         2.913         3.113          3.255          .        .      3.590         3.646
   TABLE A.7
   Critical Values for Two-Tailed Bonferroni Test at 0.05 Significance Level
   Number of Subjects             2             3             4             5           …         ...         9          10
   TABLE A.8
   Data Set Example
   WMC             DIT         NOC                  CBO             RFC                 LCOM            LOC            Fault
   28               1             0                  32              82                  374            926              1
    6               1             2                   3               7                    3             36              0
    4               2             0                   5               6                    4             21              0
    4               1             0                   9               4                    6              4              0
    1               1             0                   8               1                    0              1              0
                                                                                                                  (Continued)
442                                                             Appendix
A. Abran and J. Moore, “Guide to the software engineering body of knowledge,” In IEEE Computer
      Society, Piscataway, NJ, 2004.
ACM, Computing Machinery, “ACM code of ethics and professional conduct,” 2015, http://www
      .acm.org/constitution/code.html.
W. Afzal, “Metrics in software test planning and test design processes,” PhD Dissertation, School of
      Engineering, Blekinge Institute of Technology, Karlskrona, Sweden, 2007.
W. Afzal, “Using faults-slip-through metric as a predictor of fault proneness,” In Proceedings of the
      17th Asia Pacific Software Engineering Conference, pp. 414–422, 2010.
W. Afzal, and R. Torkar, “Lessons from applying experimentation in software engineering predic-
      tion systems,” In Proceedings of the 2nd International. WS on Software Productivity Analysis and
      Cost Estimation, co-located with APSEC, vol. 8, 2008.
W. Afzal, R. Torkar, and R. Feldt, “A systematic literature review of search-based software testing for
      non-functional system properties,” Information and Software Technology, vol. 51, no. 6, 957–976,
      2009.
K. K. Aggarwal, Y. Singh, A. Kaur, and R. Malhotra, “Empirical analysis for investigating the effect
      of object-oriented metrics on fault proneness: A replicated study,” Software Process: Improvement
      and Practice, vol. 16, no. 1, pp. 39–62, 2009.
K. K. Aggarwal, Y. Singh, A. Kaur, and R. Malhotra, “Empirical study of object-oriented metrics,”
      Journal of Object Technology, vol. 5, no. 8, pp. 149–173, 2006a.
K. K. Aggarwal, Y. Singh, A. Kaur, and R. Malhotra, “Investigating the effect of coupling metrics on
      fault proneness in object-oriented systems,” Software Quality Professional, vol. 8, no. 4, pp. 4–16,
      2006b.
K. K. Aggarwal, Y. Singh, A. Kaur, and R. Malhotra, “Investigating the effect of design metrics on fault
      proneness in object-oriented systems,” Journal of Object Technology, vol.  6, no.  10, pp.  127–141,
      2007.
K. K. Aggarwal, Y. Singh, A. Kaur, and R. Malhotra, “Software reuse metrics for object-oriented
      systems,” In Proceedings of the 3rd ACIS International Conference on Software Engineering Research,
      Management & Applications, Central Michigan University, Mount Pleasant, MI, pp. 48–55, 2005.
J. Al Dallal, “Fault prediction and the discriminative powers of connectivity-based object-oriented
      class cohesion metrics,” Information and Software Technology, vol. 54, no. 4, pp. 396–416, 2012a.
J. Al Dallal, “The impact of accounting for special methods in the measurement of object-oriented
      class cohesion on refactoring and fault prediction activities,” Journal of Systems and Software,
      vol. 85, no. 5, pp. 1042–1057, 2012b.
J. Al Dallal, “Improving the applicability of object-oriented class cohesion metrics,” Information and
      Software Technology, vol. 53, no. 9, pp. 914–928, 2011.
J. Alcalá, A. Fernández, J. Luengo, J. Derrac, S. García, L. Sánchez, and F. Herrera, “Keel data-mining
      software tool: Data set repository, integration of algorithms and experimental analysis frame-
      work,” Journal of Multiple-Valued Logic and Soft Computing, vol. 17, no. 11, pp. 255–287, 2011.
M. Alshayeb, and W. Li, “An empirical investigation of object-oriented metrics in two different itera-
      tive processes,” IEEE Transactions on Software Engineering, vol. 29, no. 11, pp. 1043–1049, 2003.
V. Ambriola, L. Bendix, and P. Ciancarini, “The evolution of configuration management and version
      control,” Software Engineering Journal, vol. 5, no. 6, pp. 303–310, 1990.
M. D. Ambros, M. Lanza, and R. Robbes, “Evaluating defect prediction approaches: A benchmark
      and an extensive comparison,” Empirical Software Engineering, vol. 17, no. 4–5, pp. 531–577, 2012.
M. D. Ambros, M. Lanza, and R. Robbes, “An extensive comparison of bug prediction approaches,” In 7th
      IEEE Working Conference on Mining Software Repositories, Cape Town, South Africa, pp. 31–41, 2010.
M. D. Ambros, M. Lanza, and R. Robbes, “On the relationship between change coupling and soft-
      ware defects,” In 16th Working Conference on Reverse Engineering, Lille, France, pp. 135–144, 2009.
                                                                                                     445
446                                                                                              References
D. Anderson, D. Sweeney, T. Williams, J. Camm, and J. Cochran, Statistics for Business & Economics,
       Cengage Learning, Mason, OH, 2002.
J. Anvik, L. Hiew, and G. C. Murphy, “Who should fix this bug?,” In 28th International Conference on
       Software Engineering, Shanghai, China, pp. 361–370, 2006.
A. Arcuri, and G. Fraser, “Parameter tuning or default values? An empirical investigation in search-
       based software engineering,” Empirical Software Engineering, vol. 18, no. 3, pp. 594–623, 2013.
E. Arisholm, and L. C. Briand, “Predicting fault-prone components in a java legacy system,”
       In Proceedings of the 2006 ACM/IEEE International Symposium on Empirical Software Engineering,
       New York: ACM pp. 8–17, 2006.
E. Arisholm, L. C. Briand, and A. Foyen, “Dynamic coupling measures for object-oriented soft-
       ware,” IEEE Transactions on Software Engineering, vol. 30, no. 8, pp. 491–506, 2004.
E. Arisholm, L. C. Briand, and E. B. Johanessen, “A systematic and comprehensive investigation of
       methods to build and evaluate fault prediction models,” Journal of Systems and Software, vol. 83,
       no. 1, pp. 2–17, 2010.
D. Ary, L. C. Jacobs, and A. Razavieh, “Introduction to Research in Education,” New York: Holt
       Rinehart & Winston, vol. 1, pp. 9–72, 1972.
D. Azar, and J. Vybihal, “An ant colony optimization algorithm to improve software quality pre-
       diction models: Case of class stability,” Information and Software Technology, vol.  53, no.  4,
       pp. 388–393, 2011.
E. R. Babbie, Survey Research Methods, Wadsworth, OH: Belmont, 1990.
A. W. Babich, Software Configuration Management: Coordination for Team Productivity, Addison-Wesley,
       Boston, MA, 1986.
T. Ball, J. M. Kim, A. A. Porter, and H. P. Siy, “If your version control system could talk,” In ICSE Workshop
       on Process Modelling and Empirical Studies of Software Engineering, vol. 11, Boston, MA, 1997.
R. K. Bandi, V. K. Vaishnavi, and D. E. Turk, “Predicting maintenance performance using object-
       oriented design complexity metrics,” IEEE Transactions on Software Engineering, vol. 29, no. 1,
       pp. 77–87, 2003.
J. Bansiya, and C. Davis, “A hierarchical model for object-oriented design quality assessment,” IEEE
       Transactions on Software Engineering, vol. 28, no. 1, pp. 4–17, 2002.
G. M. Barnes, and B. R. Swim, “Inheriting software metrics,” Journal of Object Oriented Programming,
       vol. 6, no. 7, 27–34, 1993.
V. Barnett, and T. Lewis, Outliers in Statistical Data, New York: Wiley, 1994.
M. O. Barros, and A. C. D. Neto, “Threats to validity in search-based software engineering empirical
       studies,” Technical Report TR 0006/2011, UNIRIO—Universidade Federal do Estado do, Rio de
       Janeiro, Brazil, 2011.
V. R. Basili, L. C. Briand, and W. L. Melo, “A validation of object-oriented design metrics as quality
       indicators,” IEEE Transactions on Software Engineering, vol. 22, no. 10, pp. 751–761, 1996.
 V. R. Basili, and D. M. Weiss, “A methodology for collecting valid software engineering data,” IEEE
       Transactions on Software Engineering, vol. 6, pp. 728–738, 1984.
V. R. Basili and R. Reiter, “Evaluating automable measures of software models,” In IEEE Workshop on
       Quantitative Software Models, New York, pp. 107–116, 1979.
V. R. Basili, R. W. Selby, and D. H. Hutchens, “Experimentation in software engineering,” IEEE
       Transactions on Software Engineering, vol. 12, no. 7, pp. 733–743, 1986.
U. Becker-Kornstaedt, “Descriptive software process modeling how to deal with sensitive process
       information,” Empirical Software Engineering, vol. 6, no. 4, pp. 353–367, 2001.
J. Bell, Doing Your Research Project: A Guide for First-Time Researchers in Education, Health and Social
       Science, Maidenhead: Open University Press, 2005.
D. A. Belsley, E. Kuh, and R. Welsch, Regression Diagnostics: Identifying Influential Data and Sources of
       Collinearity, New York: Wiley, 1980.
R. Bender, “Quantitative risk assessment in epidemiological studies investigating threshold effects,”
       Biometrical Journal, vol. 41, no. 3, pp. 305–319, 1999.
S. Benlarbi, and W. L. Melo, “Polymorphism measures for early risk prediction,” In Proceedings of the
       21st International Conference on Software Engineering, Los Angeles, CA, pp. 335–344, 1999.
References                                                                                               447
S. Benlarbi, K. El Emam, N. Goel, and S. Rai, “Thresholds for object-oriented measures,” In Proceedings
      of 11th International Symposium on Software Reliability Engineering, San Jose, CA, pp. 24–38, 2000.
E. H. Bersoff, V. D. Henderson, and S. G. Siegel, “Software configuration management,” ACM
      SIGSOFT Software Engineering Notes, vol. 3, no. 5, pp. 9–17, 1978.
N. Bevan, “Measuring usability as quality of use,” Software Quality Journal, vol. 4, no. 2, pp. 115–150,
      1995.
J. Bieman, and B. Kang, “Cohesion and reuse in an object oriented system,” In Proceedings of the ACM
      Symposium Software Reusability, Seattle, WA: ACM, pp. 259–262, 1995.
J. Bieman, G. Straw, H. Wang, P. W. Munger, and R. T. Alexander, “Design patterns and change
      proneness: An examination of five evolving systems,” In Proceedings of the 9th International
      Software Metrics Symposium, Sydney, Australia, pp. 40–49, 2003.
A. Binkley, and S. Schach, “Validation of the coupling dependency metric as a risk predictor,”
      In Proceedings of the International Conference on Software Engineering, Kyoto, Japan, pp. 452–455, 1998.
A. Birk, T. Dingsøyr, and T. Stålhane, “Postmortem: Never leave a project without it,” IEEE Software,
      vol. 19, pp. 43–45, 2002.
L. D. Bloomberg, and M. Volpe, Completing Your Qualitative Dissertation: A Roadmap from Beginning to
      End, London: Sage Publications, 2012.
G. Booch, Object-Oriented Analysis and Design with Applications, 2nd edition, Benjamin/Cummings,
      San Francisco, CA, 1994.
M. Bramer, Principles of Data Mining, Springer, London, 2007.
P. Brereton, B. Kitchenham, D. Budgen, M. Turner, and M. Khalid, “Lessons from applying the sys-
      tematic literature review process within the software engineering domain,” Journal of Systems
      and Software, vol. 80, no. 4, pp. 571–583, 2007.
L. C. Briand, J. W. Daly, and J. K. Wust, “A unified framework for cohesion measurement in object-
      oriented systems,” Empirical Software Engineering, vol. 3, no. 1, pp. 65–117, 1998.
L. C. Briand, J. W. Daly, and J. K. Wust, “A unified framework for coupling measurement in
      object-oriented systems,” IEEE Transactions on Software Engineering, vol. 25, no. 1, pp. 91–121,
      1999a.
L. C. Briand, S Morasca, V. R. Basili, “Defining and validating measures for object-based high-level
      design,” IEEE Transactions on Software Engineering, vol. 25, no. 5, pp. 722–743, 1999b.
L. C. Briand, P. Devanbu, and W. Melo, “An investigation into coupling measures for C++,”
      In Proceedings of the ICSE, Boston, MA: ACM, pp. 412–421, 1997.
L. C. Briand, and J. W. Wüst. “Empirical studies of quality models in object-oriented systems,”
      Advances in Computers, vol. 56, pp. 97–166, 2002.
L. C. Briand, J. W. Wüst, J. W. Daly, and D. V. Porter, “Exploring the relationships between design
      measures and software quality in object-oriented systems,” Journal of Systems and Software,
      vol. 51, no. 3, pp. 245–273, 2000.
L. C. Briand, J. W. Wust, and H. Lounis, “Replicated case studies for investigating quality factors in
      object-oriented designs,” Empirical Software Engineering, vol. 6, no. 1, pp. 11–58, 2001.
L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, no. 2, pp. 123–140, 1996.
L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen, Classification and Regression Tree, CRC Press,
      Boca Raton, FL, 1984.
F. B. Abreu, and W. Melo, “Evaluating the impact of object-oriented design on software quality,” In
      Proceedings of the 3rd International Symposium on Software Metrics, Berlin, Germany, pp. 90–99,
      1996.
V. R. Caldiera, G. Caldiera, and H. D. Rombach, “The goal question metric approach,” Encyclopedia
      of Software Engineering, vol. 2, pp. 528–532, 1994.
D. T. Campbell, and J. C. Stanley, “Experimental and quasi-experimental designs for research,”
      Boston, MA: Houghton Miffin Company, 1963.
G. Canfora and L. Cerulo, “How software repositories can help in resolving a new change
      request,” In Proceedings of Workshop on Empirical Studies in Reverse Engineering, Paolo Tonella,
      Italy, 2005.
448                                                                                             References
L. Di Geronimo, F. Ferrucci, A. Murolo, and F. Sarro, “A parallel genetic algorithm based on hadoop
      mapreduce for the automatic generation of junit test suites,” In Proceedings of 5th International
      Conference on Software Testing, Verification and Validation, Montreal, Canada, pp. 785–793, 2012.
S. Di Martino, F. Ferrucci, C. Gravino, and F. Sarro, “A genetic algorithm to configure support vec-
      tor machines for predicting fault prone components,” In Proceedings of the 12th  International
      Conference on Product-Focused Software Process Improvement, Limerick, Ireland, pp. 247–261, 2011.
K. Dickersin, “The existence of publication bias and risk factors for its occurrence,” Journal of the
      American Medical Association, vol. 263, no. 10, pp. 1385–1395, 1990.
W. Ding, P. Liang, A. Tang, and H. V. Vilet, “Knowledge-based approaches in software documentation:
      A systematic literature review,” Information and Software Technology, vol. 56, no. 6, pp. 545–567, 2014.
E. Duman, “Comparison of decision tree algorithms in identifying bank customers who are likely
      to buy credit cards,” In Proceedings of the 7th International Baltic Conference on Databases and
      Information Systems, Kaunas, Lithuania, July 3–6, 2006.
R. P. Duran, M. A. Eisenhart, F. D. Erickson, C. A. Grant, J. L. Green, L. V. Hedges, F. J. Levine, P. A.
      Moss, J. W. Pellegrino, and B. L. Schneider, “Standards for reporting on empirical social sci-
      ence research in AERA publications american educational research association,” Educational
      Researcher, vol. 35, no. 6, pp. 33–40, 2006.
B. Eftekhar, K. Mohammad, H. Ardebili, M. Ghodsi, and E. Ketabchi, “Comparision of artificial neural
      network and logistic regression models for prediction of mortality in head trauma based on ini-
      tial clinical data,” BMC Medical Informatics and Decision Making, vol. 5, no. 3, pp. 1–8 2005.
K. El Emam, “Ethics and open source,” Empirical Software Engineering, vol. 6, no. 4, pp. 291–292,
      2001.
K. El Emam, S. Benlarbi, N. Goel, and S. N. Rai, “The confounding effect of class size on the validity
      of object-oriented metrics,” IEEE Transactions on Software Engineering, vol. 27, no. 7, pp. 630–650,
      2001a.
K. El Emam, S. Benlarbi, N. Goel, and S. Rai, “The Optimal Class Size for Object-Oriented Software:
      A Replicated Case Study,” Technical Report ERB-1074, National Research Council of Canada,
      Canada, 2000a.
K. El Emam, S. Benlarbi, N. Goel, and S. Rai, “A validation of object- oriented metrics,” Technical
      Report ERB-1063, National Research Council of Canada, Canada, 1999.
K. El Emam, N. Goel, and S. Rai, “Thresholds for object oriented measures,” In Proceedings of the
      11th International Symposium on Software Reliability Engineering, San Jose, CA, pp. 24–38, 2000b.
K. El Emam, W. Melo, and J. C. Machado, “The prediction of faulty classes using object-oriented
      design metrics,” Journal of Systems and Software, vol. 56, no. 1, pp. 63–75, 2001b.
A. E. Eiben and J. E. Smith, Introduction to Evolutionary Computing, Natural Computing Series, New
      York: Springer-Verlag, 2003.
M. O. Elish, and M. A. Al-Khiaty, “A suite for quantifying historical changes to predict future
      change-prone classes in object-oriented software,” Journal of Software: Evolution and Process,
      vol. 25, no. 5, pp. 407–437, 2013.
M. O. Elish, A. Al-Yafei, and M. Al-Mulhem, “Empirical comparison of three metrics suites for
      fault prediction in packages of object-oriented systems: A case study of Eclipse,” Advances in
      Engineering Software, vol. 42, no. 10, pp. 852–859, 2011.
K. Erni, and C. Lewerentz, “Applying design-metrics to object-oriented frameworks,” In Proceedings
      of the 3rd International in Software Metrics Symposium, New York: IEEE, pp. 64–74, 1996.
L. H. Etzkorn, J. Bansiya, and C. Davis, “Design and code complexity metrics for OO classes,” Journal
      of Object-Oriented Programming, vol. 12, no. 1, pp. 35–40, 1999.
N. Fenton, and M. Neil, “A critique of software defect prediction models,” IEEE Transactions on
      Software Engineering, vol. 25, no. 3, pp. 1–15, 1999.
N. E. Fenton, and S. L. Pfleeger, Software Metrics—A Rigorous & Practical Approach, International
      Thomson Computer Press, 1996.
M. Fowler, Refactoring: Improving the Design of Existing Code, New Delhi, India: Pearson Education, 1999.
V. French, “Establishing software metric thresholds,” In Proceedings of the 9th International Workshop
      on Software Measurement, Quebec, Canada, 1999.
450                                                                                           References
Y. Freund, and L. Mason, “The alternating decision tree algorithm,” In Proceedings of 16th International
       Conference on Machine Learning, Bled, Slovenia, pp. 124–133, 1999.
Y. Freund, and R. E. Schapire, “Experiments with a new boosting algorithm,” Proceedings of the 13th
       International Conference on Machine Learning, San Francisco, CA, vol. 96, pp. 148–156, 1996.
Y. Freund, and R. E. Schapire, “Large margin classification using the perceptron algorithm,” Machine
       Learning, vol. 37, no. 3, pp. 277–296, 1999.
J. Friedman, T. Hastie, and R. Tibshirani, “Additive logistic regression: A statistical view of boost-
       ing,” The Annals of Statistics, vol. 28, no. 2, pp. 337–407, 2000.
M. Friedman, “A comparison of alternative tests of significance for the problem of m rankings,” The
       Annals of Mathematical Statistics, vol. 11, no. 1, pp. 86–92, 1940.
H. Gall, K. Hajek, and M. Jazayeri, “Detection of logical coupling based on product release his-
       tory,” In Proceedings of the 14th International Conference on Software Maintenance, Bethesda, MD,
       pp. 190–198, 1998.
H. Gall, M. Jazayeri, and J. Krajewski, “CVS release history data for detecting logical couplings,” In
       Proceedings of IEEE 6th International Workshop on Software Evolution, Helsinki, Finland, pp. 13–23,
       2003.
H. Gall, M. Jazayeri, R. R. Klosch, and G. Trausmuth, “Software evolution observations based on
       product release history,” In Proceedings of the International Conference on Software Maintenance,
       pp. 160–166, Bari, Italy, 1997.
S. García, A. D. Benítez, F. Herrera, and A. Fernández, “Statistical comparisons by means of
       non-parametric tests: A case study on genetic based machine learning,” Algorithms, vol. 13,
       pp. 95–104, 2007.
D. Glassberg, K. El-Emam, W. Melo, and N. Madhavji, Validating Object-Oriented Design Metrics on
       a Commercial Java Application, Technical Report, NRC-ERB-1080, National Research Council of
       Canada, Canada, 2000.
I. Gondra, “Applying machine learning to software fault-proneness prediction,” Journal of Systems
       and Software, vol. 81, no. 2, pp. 186–195, 2008.
P. Goodman, Practical Implementation of Software Metrics, London: McGraw-Hill, 1993.
C. Grosan, and A. Abraham, “Hybrid evolutionary algorithms: Methodologies, architectures and
       reviews, studies in computational intelligence,” In Hybrid Evolutionary Algorithms, Berlin,
       Germany: Springer, pp. 1–17, 2007.
T. Gyimothy, R. Ferenc, and I. Siket, “Empirical validation of object-oriented metrics on open
       source software for fault prediction,” IEEE Transactions on Software Engineering, vol. 31, no. 10,
       pp. 897–910, 2005.
J. Hair, R. Anderson, R. Tatham, and W. Black, Multivariate Data Analysis, Upper Saddle River, NJ:
       Pearson, 2006.
M. Hall, “Benchmarking attribute selection techniques for discrete class data mining,” IEEE
       Transactions on Knowledge and Data Engineering, vol. 15, no. 3, pp. 1–16, 2003.
M. A. Hall, “Correlation-based feature selection for discrete and numeric class machine learning,”
       In Proceedings of the 7th International Conference on Machine Learning, pp. 359–366, 2000.
A. R. Han, S. Jeon, D. H. Bae, and J. Hong, “Measuring behavioural dependency for improving
       change-proneness prediction in UML-based design models,” Journal of Systems and Software,
       vol. 83, no. 2, pp. 222–234, 2010.
J. Han, and M. Kamber, Data Mining: Concepts and Techniques, San Francisco, CA: Morgan Kaufmann,
       2001.
J. A. Hanley, and B. J. McNeil, “The meaning and use of the area under a receiver operating charac-
       teristic ROC curve,” Radiology, vol. 143, no. 1, pp. 29–36, 1982.
M. Harman, E. K. Burke, J. A. Clark, and Xin Yao, “Dynamic adaptive search based software engi-
       neering,” In Proceedings of IEEE International Symposium on Empirical Software Engineering and
       Measurement, Lund, Sweden, pp. 1–8, 2012a.
M. Harman, and J. A. Clark, “Metrics are fitness functions too,” In Proceedings of 10th IEEE International
       Symposium on Software Metrics, Chicago, IL, 2004.
References                                                                                          451
M. Harman, Y. Jia, and Y. Zhang, “App store mining and analysis: MSR for app stores,” In Proceedings
      of 9th IEEE Working Conference on Mining Software Repositories, Zurich, Switzerland, pp. 108–111,
      June 2012b.
M. Harman, and B. F. Jones, “Search-based software engineering,” Information and Software
      Technology, vol. 43, no. 14, pp. 833–839, 2001.
M. Harman, S. A. Mansouri, and Y. Zhang, “Search-based software engineering: Trends, techniques
      and applications,” ACM Computing Survey, vol 45, no. 1, pp. 1–64, 2012c.
M. Harman, P. McMinn, J. Teixeira de Souza, and S. Yoo, “Search based software engineering:
      Techniques, taxonomy and tutorial,” In Empirical Software Engineering and Verification, Lecture
      Notes in Computer Science, Berlin, Germany: Springer-Verlag, pp. 1–59, 2012d.
D. L. Harnett, and J. L. Murphy, Introductory Statistical Analysis, Don Mills, Ontario, Canada: Addison-
      Wesley, 1980.
R. Harrison, S. Counsell, and R. Nithi, “Experimental assessment of the effect of inheritance on the
      maintainability of object oriented systems,” Journal of Systems and Software, vol. 52, pp. 173–179,
      1999.
A. Hassan, “The road ahead for mining software repositories,” In Frontiers of Software Maintenance,
      pp. 48–57, Beijing, People’s Republic of China, 2008.
E. Hassan, and R. C. Holt, “Predicting change propagation in software systems,” In Proceedings of the
      20th International Conference on Software Maintenance, Chicago, IL, pp. 284–293, 2004a.
E. Hassan, and R. C. Holt, “Using development history sticky notes to understand software archi-
      tecture,” In Proceedings of the 12th International Workshop on Program Comprehension, Bari, Italy,
      pp. 183–192, 2004b.
O. Hauge, C. Ayala, and R. Conradi, “Adoption of open source software in software-intensive
      organizations—a systematic literature review,” Information and Software Technology, vol.  52,
      no. 11, pp. 1133–1154, 2010.
S. Haykin, and R. Lippmann, “Neural networks, A comprehensive foundation,” International Journal
      of Neural Systems, vol. 5, no. 4, pp. 363–364, 1994.
S. Haykin, Neural Network: A Comprehensive Foundation, Prentice Hall, New Delhi, India, vol. 2, 1998.
H. He, and E. A. Garcia, “Learning from imbalanced data,” IEEE Transactions on Knowledge and Data
      Engineering, vol. 21, no. 9, pp. 1263–1284, 2009.
Z. He, F. Shu, Y. Yang, M. Li, and Q. Wang, “An investigation on the feasibility of cross-project defect
      prediction,” Automated Software Engineering, vol. 19, no. 2, pp. 167–199, 2012.
B. Henderson-Sellers, Object-Oriented Metrics: Measures of Complexity, Prentice Hall, NJ, 1996.
B. Henderson-Sellers, “Some metrics for object-oriented software engineering,” In Proceedings of the
      1st IEEE International Conference on New Technology and Mobile Security, Beirut, Lebanon, 2007.
S. Herbold, “Training data selection for cross-project defect prediction,” In Proceedings of the 9th
      International Conference on Predictive Models in Software Engineering, Baltimore, MD, 2013.
M. Hitz, and B. Montazeri, “Measuring coupling and cohesion in object-oriented systems,” In Proceed-
      ings of the International Symposium on Applied Corporate Computing, Monterrey, Mexico, 1995.
R. P. Hooda, Statistics for Business and Economics, New Delhi, India: Macmillan, 2003.
W. G. Hopkins, “A new view of statistics,” Sport Science, 2003. http://www.sportsci.org/resource/
      stats/.
D. W. Hosmer, and S. Lemeshow, Applied Logistic Regression, New York: Wiley, 1989.
S. K. Huang, and K. M. Liu, “Mining version histories to verify the learning process of legitimate
      peripheral participants,” In Proceedings of the 2nd International Workshop on Mining Software
      Repositories, St. Louis, MO, pp. 84–78, 2005.
IEEE, IEEE Guide to Software Configuration Management, IEEE/ANSI Standard 1042–1987, IEEE, 1987.
IEEE, IEEE Standard Classification for Software Anomalies, IEEE Standard 1044–1993, IEEE, 1994.
M. Jorgenson, and M. Shepperd, “A systematic review of software development cost estimation
      studies,” IEEE Transactions on Software Engineering, vol. 33, no. 1, 33–53, 2007.
H. H. Kagdi, I. Maletic, and B. Sharif, “Mining software repositories for traceability links,” In
      Proceedings of 15th IEEE International Conference on Program Comprehension, pp. 145–154, 2007.
452                                                                                         References
V. B. Livshits, and T. Zimmermann, “DynaMine: Finding common error patterns by mining soft-
      ware revision histories,” In Proceedings of the 10th European Software Engineering Conference held
      jointly with 13th ACM SIGSOFT International Symposium on Foundations of Software Engineering,
      Lisbon, Portugal, pp. 296–305, 2005.
M. Lorenz, and J. Kidd, “Object-oriented software metrics,” Prentice Hall, NJ, 1994.
H. Lu, Y. Zhou, B. Xu, H. Leung, and L. Chen, “The ability of object-oriented metrics to predict
      change-proneness: A meta-analysis,” Empirical Software Engineering Journal, vol.  17, no.  3,
      pp. 200–242, 2012.
Y. Ma, G. Luo, X. Zeng, and A. Chen, “Transfer learning for cross-company software defect predic-
      tion,” Information and Software Technology, vol. 54, no. 3, pp. 248–256, 2012.
R. Malhotra, “Empirical validation of object-oriented metrics for predicting quality attributes,” PhD
      Dissertation, New Delhi, India: Guru Gobind Singh Indraprastha University, 2009.
R. Malhotra, “A systematic review of machine learning techniques for software fault predic-
      tion,” Applied Software Computing, vol. 27, pp. 504–518, 2015.
R. Malhotra, and A. J. Bansal, “Investigation on feasibility of machine learning algorithms for
      predicting software change using open source software,” International Journal of Reliability,
      Quality and Safety Engineering, World Scientific, Singapore, 2014a.
R. Malhotra, and A. Jain, “Fault Prediction Using statistical and machine learning methods for
      improving software quality,” Journal of Information Processing System, vol. 8, pp. 241–262, 2012.
R. Malhotra, and A. Jain, “Software Effort Prediction using Statistical and Machine Learning
      Methods,” International Journal of Advanced Computer Science and Applications, vol. 2, no. 1, pp.
      145–152, 2011.
R. Malhotra, and M. Khanna, “The ability of search-based algorithms to predict change-prone
      classes,” Software Quality Professional, vol. 17, no. 1, pp. 17–31, 2014b.
R. Malhotra, and M. Khanna, “Investigation of relationship between object-oriented metrics and change
      proneness,” International Journal of Machine Learning and Cybernetics, vol. 4, no. 4, pp. 273–286, 2013.
R. Malhotra, and M. Khanna, “Software engineering predictive modeling using search-based tech-
      niques: Systematic review and future directions,” In 1st North American Search-Based Software
      Engineering Symposium, Dearborn, MI, 2015.
R. Malhotra, and Y. Singh, “A defect prediction model for open source software,” In Proceedings of the
      World Congress on Engineering, London, Vol II, 2012.
R. Malhotra, and Y. Singh, “On the applicability of machine learning techniques for object-oriented
      software fault prediction,” Software Engineering: An International Journal, vol. 1, pp. 24–37, 2011.
R. Malhotra, Y. Singh, and A. Kaur, “Empirical validation of object-oriented metrics for predict-
      ing fault proneness at different severity levels using support vector machines,” International
      Journal of System Assurance Engineering Management, vol. 1, pp. 269–281, 2010.
R. Malhotra, “Search based techniques for software fault prediction: Current trends and future
      directions,” In Proceedings of 7th International Workshop on Search-Based Software Testing,
      Hyderabad, India, pp. 35–36, 2014a.
R. Malhotra, “Comparative Analysis of statistical and machine learning methods for predicting
      faulty modules,” Applied Soft Computing, vol. 21, pp. 286–297, 2014b.
R. Malhotra and A. Bansal, “Fault prediction considering threshold effects of object oriented met-
      rics,” Expert Systems, vol. 32, no. 2, pp. 203–219, 2015.
R. Malhotra and, M. Khanna, “A new metric for predicting software change using gene expres-
      sion programming”, In Proceedings of 5th International Workshop on Emerging Trends in Software
      Metrics, Hyderabad, India, pp. 8–14, 2014a.
R. Malhotra, N. Pritam, K. Nagpal, and P. Upmanyu, “Defect collection and reporting system for
      git based open source software,” In Proceedings of International Conference on Data Mining and
      Intelligent Computing, Delhi, India, pp. 1–7, 2014.
R. Malhotra and R. Raje, “An empirical comparison of machine learning techniques for software
      defect prediction,” In Proceedings of 8th International Conference on Bio-Inspired Information and
      Communication Technologies, Boston, MA, pp. 320–327, 2014.
454                                                                                           References
A. Marcus, D. Poshyvankyk, and R. Ferenc, “Using the conceptual cohesion of classes for fault pre-
      diction in object-oriented systems,” IEEE Transactions on Software Engineering, vol.  34, no.  2,
      pp. 287–300, 2008.
F. Marini, R. Bucci, A. L. Magri, and A. D. Magri, “Artificial neural networks in chemometrics:
      History, examples and perspectives,” Microchemical Journal, vol. 88, no. 2, pp. 178–185, 2008.
R. C. Martin, Agile Software Development: Principles, Patters, and Practices. Prentice Hall, NJ, 2002.
G. Mausa, T. G. Grbac, and B. D. Basic, “Multivariate logistic regression prediction of fault prone-
      ness in software modules,” In Proceedings of the IEEE 35th International Convention on MIPRO,
      Adriatic Coast, Craotia, pp. 698–703, 2012.
T. J. McCabe, “A complexity measure,” IEEE Transactions on Software Engineering, vol.  SE-2, no.  4,
      pp. 308–320, 1976.
Metrics Data Program, 2006, http://sarpresults.ivv.nasa.gov/ViewResearch/107.jsp.
T. Menzies, A. Butcher, D. Cok, A. Marcus, L. Layman, F. Shell, B. Turhan, and T. Zimmermann,
      “Local versus global lessons for defect prediction and effort estimation,” IEEE Transactions on
      Software Engineering, vol. 39, no. 6, pp. 822–834, 2013.
T. Menzies, J. Greenwald, and A. Frank, “Data mining static code attributes to learn defect predic-
      tors,” IEEE Transactions on Software Engineering, vol. 33, no. 1, pp. 2–13, 2007.
T. Menzies, and A. Marcus, “Automated severity assessment of software fault reports,” IEEE
      International Conference on Software Maintenance, 2008.
T. Menzies, Z. Milton, B. Turhan, B. Cukic, Y. Jiang, and A. Bener, “Defect prediction from static code
      features: Current reults, limitations, new approaches,” Automated Software Engineering, vol. 17,
      no. 4, pp. 375–407, 2010.
B. Meyer, H. Gall, M. Harman, and G. Succi, “Empirical answers to fundamental software engineer-
      ing problems (panel),” In ESEC/SIGSOFT FSE, Saint Petersburg, Russia, pp. 14–18, 2013.
L. S. Meyers, G. C. Gamst, and A. J. Guarino. Applied Multivariate Research: Design and Interpretation,
      Thousand Oaks, CA: Sage, 2013.
J. Michura, and M. A. M. Capretz, “Metrics suite for class complexity,” In Proceedings of the International
      Conference on Information Technology: Coding and Computing, Las Vegas, CL, pp. 404–409, 2005.
A. T. Misirh, A. B. Bener, and B. Turhan, “An industrial case study of classifier ensembles for locating
      software defects,” Software Quality Journal, vol. 19, no. 3, pp. 515–536, 2011.
A. Mitchell, and J. F. Power, “An empirical investigation into the dimensions of run-time coupling in
      Java programs,” In Proceedings of the 3rd Conference on the Principles and Practice of Programming
      in Java, Las Vegas, NV, pp. 9–14, 2004.
A. Mitchell, and J. F. Power, “Run-time cohesion metrics for the analysis of Java programs,” Technical
      Report, Series No.  NUIM-CS-TR-2003–08, Kildare, Ireland: National University of Ireland,
      Maynooth, Co., 2003.
W. N. H. W. Mohamed, M. N. M. Salleh, and A. H. Omar, “A comparative study of reduced error
      pruning method in decision tree algorithms,” In Proceedings of the IEEE International Conference
      on Control System, Computing and Engineering, IEEE, Penang, Malaysia, pp. 392–397, 2012.
R. Moser, W. Pedrycz, and G. Succi, “A Comparative analysis of the efficiency of change metrics and
      static code attributes for defect prediction,” In Proceedings of International Conference on Software
      Engineering, Leipzig, Germany, pp. 181–190, 2008.
T. R. G. Nair, and R. Selvarani, “Defect proneness estimation and feedback approach for software
      design quality improvement,” Information and Software Technology, vol. 54, no. 3, pp. 274–285,
      2012.
J. Nam, S. J. Pan, and S. Kim, “Transfer defect learning,” In Proceedings of the International Conference
      on Software Engineering, San Francisco, CA, pp. 382–391, 2013.
NASA, Metrics data Repository, 2004, www.mdp.ivv.nasa.gov.
P. Naur and B. Randell (eds.), Software Engineering: Report of a Conference Sponsored by the NATO
      Science Committee, Garmisch, Germany. Brussels, Belgium: Scientific Affairs Division, NATO,
      1969.
References                                                                                             455
A. A. Neto, and T. Conte, “A conceptual model to address threats to validity in controlled experi-
       ments,” In Proceedings of the 17th International Conference on Evaluation and Assessment in Software
       Engineering, Porto de Galinhas, Brazil, pp. 82–85, 2013.
M. Ohira, N. Ohsugi, T. Ohoka, and K. I. Matsumoto, “Accelerating cross-project knowledge collabo-
       ration using collaborative filtering and social networks,” In Proceedings of the 2nd International
       Workshop on Mining Software Repositories, New York, pp. 111–115, 2005.
A. Okutan, and O. T. Yildiz, “Software defect prediction using bayesian networks,” Empirical
       Software Engineering, vol. 19, no. 1, pp. 154–181, 2014.
H. Olague, L. Etzkorn, S. Gholston, and S. Quattlebaum, “Empirical validation of three software
       metric suites to predict the fault-proneness of object-oriented classes developed using highly
       iterative or agile software development processes,” IEEE Transactions on Software Engineering,
       vol. 33, no. 10, pp. 402–419, 2007.
H. M. Olague, L. H. Etzkorn, S. L. Messimer, and H. S. Delugach, “An empirical validation of object-
       oriented class complexity metrics and their ability to predict error-prone classes in highly iter-
       ative, or agile, software: A case study,” Journal of Software Maintenance and Evolution: Research
       and Practice, vol. 20, no. 3, pp. 171–197, 2008.
A. D. Oral, and A. B. Bener, “Defect prediction for embedded software,” In Proceedings of the IEEE
       22nd International Symposium on Computer and Information Science, Ankara, Turkey, pp: 1–6, 2007.
G. J. Pai, and J. Bechta Dugan, “Empirical analysis of software fault content and fault proneness
       using Bayesian methods,” IEEE Transactions on Software Engineering, vol. 33, no. 10, pp. 675–686,
       2007.
L. Pelayo, and S. Dick, “Evaluating stratification alternatives to improve software defect prediction,”
       IEEE Transactions on Software Reliability, vol. 61, no. 2, pp. 516–525, 2012.
F. Peters, T. Menzies, and A. Marcus, “Better cross company defect prediction,” In Proceedings of the
       10th IEEE Working Conference on Mining Software Repositories, San Francisco, CA, pp. 409–418,
       2013.
S. L. Pfleeger, “Experimental design and analysis in software engineering,” Annals of Software
       Engineering, vol. 1, no. 1, pp. 219–253, 1995.
M. F. Porter, “An algorithm for suffix stripping,” Program, vol. 14, no. 3, pp. 130–137, 1980.
A. Porter, and R. Selly, “Empirically guided software development using metric-based classification
       trees,” IEEE Software, vol. 7, no. 2, pp. 46–54, 1990.
R. S. Pressman, Software Engineering: A Practitioner’s Approach, New York: Palgrave Macmillan, 2005.
PROMISE, 2007, http://promisedata.org/repository/.
J. R. Quinlan, C4.5 Programs for Machine Learning, Morgan Kaufmann, San Francisco, CA, 1993.
D. Radjenović, M. Hericko, R. Torkar, and A. Zivkovic, “Software fault prediction metrics: A sys-
       tematic literature review,” Information and Software Technology, vol. 55, no. 8, pp. 1397–1418, 2013.
F. Rahman, D. Posnett, and P. Devanbu, “Recalling the imprecision of cross-project defect predic-
       tion,” In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of
       Software Engineering, Cary, NC, p. 61, 2012.
J. Ratzinger, M. Fischer, and H. Gall, “Improving evolvability through refactoring,” In Proceedings
       of the 2nd International Workshop on Mining Software Repositories, St. Louis, MO, pp. 69–73, 2005.
M. Riaz, E. Mendes, and E. Tempero, “A systematic review of software maintainability prediction
       and metrics,” In Proceedings of the 3rd International Symposium on Empirical Software Engineering
       and Measurement, Lake Buena Vista, FL, pp. 367–377, 2009.
P. C. Rigby, and A. E. Hassan, “What can OSS mailing lists tell us? a preliminary psychometric text
       analysis of the apache developer mailing list,” In Proceedings of the 4th International Workshop on
       Mining Software Repositories, Minneapolis, MN, pp. 23, 2007.
D. Rodriguez, R. Ruiz, J. C. Riquelme, and J. S. Agular-Ruiz, “Searching for rules to detect defective
       modules: A subgroup discovery approach,” Information Sciences, vol. 191, pp. 14–30, 2012.
F. Rosenblatt, “The perceptron: A probabilistic model for information storage and organization in
       the brain,” Psychological Review, vol. 56, no. 6, pp. 386–408, 1958.
456                                                                                             References
M. Sassano, “An experimental comparison of the voted perceptron and support vector machines in
       Japanese analysis tasks,” In Proceedings of the 3rd International Conference on Natural Language
       Processing, Hyderabad, India, pp. 829–834, 2008.
H. J. Seltman, Experimental Design and Analysis, 2012, http://www. stat. cmu. edu/, hseltman/309/
       Book/Book. Pdf.
R. Shatnawi, “A quantitative investigation of the acceptable risk levels of object-oriented metrics in
       open-source systems,” IEEE Transactions on Software Engineering, vol. 36, no. 2, 2010.
R. Shatnawi, “The validation and threshold values of object-oriented metrics,” Dissertation,
       Huntsville, AL: Department of Computer Science, University of Alabama, 2006.
R. Shatnawi, and W. Li, “The effectiveness of software metrics in identifying error-prone classes in
       post-release software evolution process,” Journal of Systems and Software, vol. 81, no. 11, 1868–1882,
       2008.
R. Shatnawi, W. Li, J. Swain, and T. Newman, “Finding software metrics threshold values using
       ROC curves,” Journal of Software Maintenance and Evolution: Research and Practice, vol. 22, no. 1,
       pp. 1–16, 2010.
P. H. Sherrod, “Predictive Modeling Software,” 2003, http://www.dtreg.com.
J. Singer, and N. Vinson, “Why and how research ethics matters to you. Yes, you!,” Empirical Software
       Engineering, vol. 6, no. 4, pp. 287–290, 2001.
Y. Singh, Software Testing, New York: Cambridge University Press, 2011.
Y. Singh, A. Kaur, and R. Malhotra, “A comparative study of models for predicting fault proneness
       in object-oriented systems,” International Journal of Computer Applications in Technology, vol. 49,
       no. 1, pp. 22–41, 2014.
Y. Singh, A. Kaur, and R. Malhotra, “Empirical validation of object-oriented metrics for predicting
       fault proneness models,” Software Quality Journal, vol. 18, no. 1, pp. 3–35, 2010.
Y. Singh, A. Kaur, and R. Malhotra, “Software fault proneness prediction using support vector
       machines,” Procedings of the World Congress on Engineering, London, pp. 240–245, 2009b.
Y. Singh, and R. Malhotra, Object-Oriented Software Engineering, New Delhi, India: PHI Learning,
       2012.
J. Snoek, H. Larochelle, and R. P. Adams, “Practical Bayesian optimization of machine learning
       algorithms,” In Proceedings of Advances in Neural Information Processing Systems, Nevada,
       pp. 2951–2959, 2012.
J. Spolsky, Painless Bug Tracking, 2000, http://www.joelonsoftware.com/articles/fog0000000029
       .html.
G. E. Stark, R. C. Durst, and T. M. Pelnik, “An evaluation of software testing metrics for NASA’s mis-
       sion control center,” Software Quality Journal, vol. 1, no. 2, pp. 115–132, 1992.
K. J. Stol, M. A. Babar, B. Russo, and B. Fitzgerald, “The use of empirical methods in open source
       software research: Facts, trends and future directions,” In Proceedings of the ICSE Workshop on
       Emerging Trends in Free/Libre/Open Source Software Research and Development, Washington, DC:
       IEEE Computer Society, pp. 19–24, 2009.
M. Stone, “Cross-validatory choice and assessment of statistical predictions,” Journal of the Royal
       Statistical Society. Series B (Methodological), vol. 36, pp. 111–147, 1974.
R. Subramanyam, and M. S. Krishnan, “Empirical analysis of CK metrics for object-oriented design
       complexity: Implications for software defects,” IEEE Transactions on Software Engineering,
       vol. 29, no. 4, pp. 297–310, 2003.
M. H. Tang, M. H. Kao, and M. H. Chen, “An empirical study on object-oriented metrics,” In
       Proceedings of the 6th International Software Metrics Symposium, Boca Raton, FL, pp. 242–249, 1999.
D. Tegarden, S. Sheetz, and D. Monarchi, “A software complexity model of object-oriented systems,”
       Decision Support Systems, vol. 13, no. 3–4, pp. 241–262, 1995.
D. Tegarden, S. Sheetz, and D. Monarchi, “The effectiveness of traditional software metrics for
       object-oriented systems,” In Proceedings of the 25th Hawaii International Conference on Systems
       Sciences, Kauai, HI, pp. 359–368, 1992.
S. W. Thomas, A. E. Hassan, and D. Blostein, “Mining unstructured software repositories,” In
       Evolving Software Systems, Berlin, Germany: Springer-Verlag, 2014.
References                                                                                              457
M. M. T. Thwin, and T. Quah, “Application of neural networks for software quality prediction using
       object-oriented metrics,” Journal of Systems and Software, vol. 76, no. 2, pp. 147–156, 2005.
A. Tosun, B. Turhan, and A. Bener, “Validation of network measures as indicators of defective mod-
       ules in software systems,” In Proceedings of the 5th International Conference on Predictor Models in
       Software Engineering, Vancouver, Canada, 2009.
B. Turhan, and A. Bener, “Analysis of naïve bayes assumptions on software fault data: An empirical
       study,” Data and Knowledge Engineering, vol. 68, no. 2, pp. 278–290, 2009.
B. Turhan, G. Kocak, and A. Bener, “Software defect prediction using call graph based ranking
       (CGBR) framework,” In Proceedings of the 34th EUROMICRO Conference on Software Engineering
       and Advanced Applications, Parma, Italy, pp. 191–198, 2008.
B. Turhan, T. Menzies, A. Bener, and J. Di Stefano, “On the relative value of cross-company and
       within-company data for defect prediction,” Empirical Software Engineering, vol.  14, no.  5,
       pp. 540–578, 2009.
K. Ulm, “A statistical method for assessing a threshold in epidemiological studies,” Statistics in
       Medicine, vol. 10, no. 3, pp. 341–349, 1991.
Y. Uzun, and G. Tezel, “Rule learning with machine learning algorithms and artificial neural net-
       works,” Journal of Selcuk University Natural and Applied Science, vol. 1, no. 2, pp. 54, 2012.
N. G. Vinson, and J. Singer, “A practical guide to ethical research involving humans,” In Guide to
       Advanced Empirical Software Engineering, London: Springer, pp. 229–256, 2008.
S. Wasserman, Social Network Analysis: Methods and Applications, Cambridge: Cambridge University
       Press, 1994.
J. Wen, S. Li, Z. Lin, Y. Hu, and C. Huang, “Systematic literature review of machine learning based
       software effort estimation models,” Information and Software Technology, vol. 54, no. 1, pp. 41–59,
       2012.
E. J. Weyuker, “Evaluating software complexity measures,” IEEE Transactions on Software Engineering,
       vol. 14, no. 9, pp. 1357–1365, 1998.
D. R. White, “Cloud computing and SBSE,” In G. Ruhe and Y. Zhang (eds.), Search Based Software
       Engineering, Berlin, Germany: Springer, pp.16–18, 2013.
F. Wilcoxon, “Individual comparisons by ranking methods,” Biometrics, vol. 1, no. 6, pp. 80–83, 1945.
I. H. Witten, and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, Morgan
       Kaufmann, Burlington, MA, 2005.
C. Wohlin, P. Runeson, M. Host, M. C. Ohlsson, B. Regnell, and A. Wesslen, Experimentation in
       Software Engineering, Berlin, Germany: Springer-Verlag, 2012.
H. K. Wright, M. Kim, and D. E. Perry, “Validity concerns in software engineering research,” In FSE/
       SDP Workshop on Future of Software Engineering Research, Santa Fe, NM, pp. 411–414, 2010.
S. M. Yacoub, H. H. Ammar, and T. Robinson, “Dynamic metrics for object-oriented designs,”
       In Proceedings of the 5th International Software Metrics Symposium, Boca Raton, FL, pp. 50–61, 1999.
L. M. Yap, and B. Henderson-Sellers, “Consistency considerations of object-oriented class libraries,”
       Centre for Information Technology Research Report 93/3, University of New South Wales,
       1993.
R. K. Yin, Case Study Research: Design and Methods, Sage publications, New York, 2002.
P. Yu, X. X. Ma, and J. Lu, “Predicting fault-proneness using OO metrics: An industrial case study,”
       In CSMR, Budapest, Hungary, pp. 99–107, 2002.
Z. Yuming, and L. Hareton, “Empirical analysis of object oriented design metrics for predicting
       high servility faults,” IEEE Transactions on Software Engineering, vol. 32, no. 10, pp. 771–784, 2006.
L. Zhao, and N. Takagi, “An application of Support vector machines to Chinese character classifi-
       cation problem,” In IEEE International Conference on systems, Man and Cybernetics, Montreal,
       Canada, pp. 3604–3608, 2007.
Y. Zhao, and Y. Zhang, “Comparison of decision tree methods for finding active objects,” Advances
       in Space Research, vol. 41, no. 12, pp. 1955–1959, 2008.
Y. Zhou, and H. Leung, “Empirical analysis of object-oriented design measures for predicting high
       and low severity faults,” IEEE Transactions on Software Engineering, vol. 32, no. 10, pp. 771–789,
       2006.
458                                                                                        References
Y. Zhou, H. Leung, and B. Xu, “Examining the potentially confounding effect of class size on the
     associations between object-oriented metrics and change proneness,” IEEE Transactions on
     Software Engineering, vol. 35, no. 5, pp. 607–623, 2009.
Y. Zhou, B. Xu, and L. Hareton, “On the ability of complexity metrics to predict fault-prone classes in
     object-oriented systems,” Journal of Systems and Software, vol. 83, no. 4, pp. 660–674, 2010.
T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy, “Cross-project defect prediction:
     A large scale experiment on data vs domain vs process,” In Proceedings of the 7th Joint Meeting of
     the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundation
     of Software Engineering, Amsterdam, the Netherlands, pp. 91–100, 2009.
T. Zimmermann, P. Weibgerber, S. Diehl, and A. Zeller, “Mining version histories to guide software
     changes,” IEEE Transactions on Software Engineering, vol. 31, no. 6, pp. 429–445, 2005.
H. Zuse, Software Complexity: Measures and Methods, New York: Walter de Cruyter, 1991.
INFORMATION TECHNOLOGY
                                                                                                                                   Malhotra
EMPIRICAL RESEARCH IN SOFTWARE ENGINEERING                                                                                                                                                                                        EMPIRICAL
CONCEPTS, ANALYSIS, AND APPLICATIONS
                                                                                                                                                                                                                                  RESEARCH in
The book balances empirical research concepts with exercises, examples, and real-life case studies. The author discusses                                                                         Study
                                                                                                                                                                                                Definition
the process of developing predictive models, such as defect prediction and change prediction, on data collected from
source code repositories. She also covers the application of machine learning techniques in empirical software engineering,                                                 Reporting
                                                                                                                                                                             Results
                                                                                                                                                                                                                   Experimental
                                                                                                                                                                                                                      Design
                                                                                                                                                                                                                                  CONCEPTS, ANALYSIS,
includes guidelines for publishing and reporting results, and presents popular software tools for carrying out empirical
studies.                                                                                                                                                                                                                          AND APPLICATIONS
                                                                                                                                                                           Validating                              Mining Data
                                                                                                                                                                                                                      from
Ruchika Malhotra is an assistant professor in the Department of Software Engineering at Delhi Technological University                                                      Threats
                                                                                                                                                                                                                   Repositories
(formerly Delhi College of Engineering). She was awarded the prestigious UGC Raman Fellowship for pursuing post-
doctoral research in the Department of Computer and Information Science at Indiana University–Purdue University. She                                                               Model                     Data Analysis
                                                                                                                                                                               Development &                  & Statistical
earned her master’s and doctorate degrees in software engineering from the University School of Information Technology                                                         Interpretation                   Testing
of Guru Gobind Singh Indraprastha University. She received the IBM Best Faculty Award in 2013 and has published
more than 100 research papers in international journals and conferences. Her research interests include software testing,
improving software quality, statistical and adaptive prediction models, software metrics, neural nets modeling, and the
                                                                                                                                                                                                                                  Ruchika Malhotra
definition and validation of software metrics.
                                                                                                  K25508
                                                                                      ISBN: 978-1-4987-1972-8
                                   6000 Broken Sound Parkway, NW                                              90000
                                   Suite 300, Boca Raton, FL 33487
                                   711 Third Avenue
                                   New York, NY 10017                               9 781498 719728
           an informa business     2 Park Square, Milton Park
                                   Abingdon, Oxon OX14 4RN, UK                      w w w. c rc p r e s s . c o m                                                                                                                 A CHAPMAN & HALL BOOK