Software Quality Metrics
Software metrics can be classified into three categories: product metrics, process metrics, and
project metrics.
Product metrics describe the characteristics of the product such as size, complexity, design features,
performance, and quality level.
Process metrics can be used to improve software development and maintenance. Examples include
the effectiveness of defect removal during development, the pattern of testing defect arrival, and the
response time of the fix process.
Project metrics describe the project characteristics and execution. Examples include the number of
software developers, the staffing pattern over the life cycle of the software, cost, schedule, and
productivity.
Some metrics belong to multiple categories. For example, the in process quality metrics of a project
are both process metrics and project metrics.
Product Quality Metrics
As software quality consists of two levels: intrinsic product quality and customer satisfaction. The
metrics discuss here cover both levels:
Mean time to failure
Defect density
Customer problems
Customer satisfaction.
Intrinsic product quality is usually measured by the number of "bugs" (functional defects) in the
software or by how long the software can run before encountering a "crash." In operational
definitions, the two metrics are defect density (rate) and mean time to failure (MTTF). The MTTF
metric is most often used with safety-critical systems such as the airline traffic control systems,
avionics, and weapons.
The defect density metric, in contrast, is used in many commercial software systems.
The two metrics are correlated but are different enough to merit close attention. First, one measures
the time between failures, the other measures the defects relative to the software size (lines of code,
function points, etc.). Second, although it is difficult to separate defects and failures in actual
measurements and data tracking, failures and defects (or faults) have different meanings. According
to the IEEE/ American National Standards Institute (ANSI) standard (982.2):
An error is a human mistake that results in incorrect software.
The resulting fault is an accidental condition that causes a unit of the system to fail to function as
required.
A defect is an anomaly in a product.
A failure occurs when a functional unit of a software-related system can no longer perform its
required function or cannot perform it within specified limits.
The Defect Density Metric
comparing the defect rates of software products involves many issues. To define a rate, we first have
to operationalize the numerator and the denominator, and specify the time frame. the general concept
of defect rate is the number of defects over the opportunities for error (OFE) during a specific time
frame.
Lines of Code
The lines of code (LOC) metric is simple. The major problem comes from the ambiguity of the
operational definition, the actual counting. In the early days of Assembler programming, in which
one physical line was the same as one instruction, the LOC definition was clear. With the
availability of high- level languages the one-to-one correspondence broke down. Differences
between physical lines and instruction statements (or logical lines of code) and differences among
languages contribute to the huge variations in counting LOCs. Even within the same language, the
methods and algorithms used by different counting tools can cause significant differences in the
final counts. Jones (1986) describes several variations:
Count only executable lines.
Count executable lines plus data definitions.
Count executable lines, data definitions, and comments.
Count executable lines, data definitions, comments, and job control language.
Count lines as physical lines on an input screen.
Count lines as terminated by logical delimiters.
Function Points
A function can be defined as a collection of executable statements that performs a certain task,
together with declarations of the formal parameters and local variables manipulated by those
statements (Conte et al., 1986). The ultimate measure of software productivity is the number of
functions a development team can produce given a certain amount of resource, regardless of the
size of the software in lines of code.
The function point metric, originated by Albrecht and his colleagues at IBM in the mid-1970s, the
technique does not measure functions explicitly (Albrecht, 1979). It does address some of the problems
associated with LOC counts in size and productivity measures, especially the differences in LOC counts
that result because different levels of languages are used. It is a weighted total of five major
components that comprise an application:
Number of external inputs (e.g., transaction types)
Number of external outputs (e.g., report types)
Number of logical internal files (files as the user might conceive them, not physical files)
Number of external interface files (files accessed by the application but not maintained by it)
Number of external inquiries (types of online inquiries supported) x 4
Customer Problems Metric
Another product quality metric used by major developers in the software industry measures the
problems customers encounter when using the product. For the defect rate metric, the numerator is
the number of valid defects. However, from the customers' standpoint, all problems they encounter
while using the software product, not just the valid defects, are problems with the software.
Problems that are not valid defects may be usability problems, unclear documentation or
information, duplicates of valid defects (defects that were reported by other customers and fixes were
available but the current customers did not know of them), or even user errors. These so-called non-
defect-oriented problems, together with the defect problems, constitute the total problem space of
the software from the customers' perspective. The problems metric is usually expressed in terms of
problems per user month
Scopes of Three Quality Metrics
Customer Satisfaction Metrics
Customer satisfaction is often measured by customer survey data via the five-point scale
Very satisfied
Satisfied
Neutral
Dissatisfied
Very dissatisfied.
Satisfaction with the overall quality of the product and its specific dimensions is usually obtained
through various methods of customer surveys. For example, the specific parameters of customer
satisfaction in software monitored by IBM include the CUPRIMDSO categories (capability,
functionality, usability, performance, reliability, maintainability, documentation/information,
service, and overall); for Hewlett-Packard they are FURPS (functionality, usability, reliability,
performance, and service).
Based on the five-point-scale data, several metrics with slight variations can be constructed and used,
depending on the purpose of analysis. For example:
(1) Percent of completely satisfied customers
(2) Percent of satisfied customers (satisfied and completely satisfied)
(3) Percent of dissatisfied customers (dissatisfied and completely dissatisfied)
(4) Percent of non-satisfied (neutral, dissatisfied, and completely dissatisfied)
In addition to forming percentages for various satisfaction or dissatisfaction categories, the weighted
index approach can be used. For instance, some companies use the net satisfaction index (NSI)
to facilitate comparisons across product. The NSI has the following weighting factors:
Completely satisfied =100%
Satisfied = 75%
Neutral = 50%
Dissatisfied = 25%
Completely dissatisfied = 0%
NSI ranges from 0% (all customers are completely dissatisfied) to 100% (all customers are
completely satisfied). If all customers are satisfied (but not completely satisfied), NSI will have a
value of 75%. This weighting approach, however, may be masking the satisfaction profile of one's
customer set. For example, if half of the customers are completely satisfied and half are neutral,
NSI's value is also 75%, which is equivalent to the scenario that all customers are satisfied. If
satisfaction is a good indicator of product loyalty, then half completely satisfied and half neutral is
certainly less positive than all satisfied.
In-Process Quality Metrics
Because our goal is to understand the programming process and to learn to engineer quality into
the process, in-process quality metrics play an important role. In-process quality metrics are less
formally defined than end-product metrics, and their practices vary greatly among software
developers. On the one hand, in-process quality metrics simply means tracking defect arrival
during formal machine testing for some organizations. On the other hand, some software
organizations with well-established software metrics programs cover various parameters in each
phase of the development cycle.
Defect Density During Machine Testing
Defect rate during formal machine testing is usually positively correlated with the defect rate in the
field. Higher defect rates found during testing is an indicator that the software has experienced higher
error injection during its development process, unless the higher testing defect rate is due to an
extraordinary testing effort for example, additional testing or a new testing approach that was deemed
more effective in detecting defects. The rationale for the positive correlation is simple: Software
defect density never follows the uniform distribution. If a piece of code or a product has higher
testing defects, it is a result of more effective testing or it is because of higher latent defects in the
code. Myers (1979) discusses a counterintuitive principle that the more defects found during
testing; the more defects will be found later. That principle is another expression of the positive
correlation between defect rates during testing and in the field or between defect rates between
phases of testing.
This simple metric of defects per KLOC or function point, therefore, is a good indicator of quality
while the software is still being tested. It is especially useful to monitor subsequent releases of a
product in the same development organization. Therefore, release-to-release comparisons are not
contaminated by extraneous factors. The development team or the project manager can use the
following scenarios to judge the release quality:
If the defect rate during testing is the same or lower than that of the previous release then ask: Does
the testing for the current release deteriorate?
If the answer is no, the quality perspective is positive.
If the answer is yes, you need to do extra testing
If the defect rate during testing is substantially higher than that of the previous release then ask: Did
we plan for and actually improve testing effectiveness?
If the answer is no, the quality perspective is negative. Ironically, the only remedial
approach that can be taken at this stage of the life cycle is to do more testing, which will
yield even higher defect rates.
If the answer is yes, then the quality perspective is the same or positive.
Defect Arrival Pattern During Machine Testing
Overall defect density during testing is a summary indicator. The pattern of defect arrivals gives
more information. Even with the same overall defect rate during testing, different patterns of defect
arrivals indicate different quality levels in the field. Figure shows two contrasting patterns for both
the defect arrival rate and the cumulative defect rate. Data were plotted from 44 weeks before code-
freeze until the week prior to code-freeze. The second pattern, represented by the charts on the right
side, obviously indicates that testing started late, the test suite was not sufficient, and that the
testing ended prematurely.
Figure Two Contrasting Defect Arrival Patterns During Testing
The objective is always to look for defect arrivals that stabilize at a very low level, or times between
failures that are far apart, before ending the testing effort and releasing the software to the field.
Such declining patterns of defect arrival during testing are indeed the basic assumption of many
software reliability models. The time unit for observing the arrival pattern is usually weeks and
occasionally months. For reliability models that require execution time data, the time interval is in
units of CPU time.
When we talk about the defect arrival pattern during testing, there are actually three slightly different
metrics, which should be looked at simultaneously:
The defect arrivals (defects reported) during the testing phase by time interval (e.g., week).
These are the raw number of arrivals, not all of which are valid defects.
The pattern of valid defect arrivals when problem determination is done on the reported problems.
This is the true defect pattern.
The pattern of defect backlog overtime. This metric is needed because development
organizations cannot investigate and fix all reported problems immediately. This metric is a
workload statement as well as a quality statement. If the defect backlog is large at the end of
the development cycle and a lot of fixes have yet to be integrated into the system, the stability
of the system will be affected. Retesting is needed to ensure that targeted product quality levels
are reached.
Phase-Based Defect Removal Pattern
The phase-based defect removal pattern is an extension of the test defect density metric. In addition
to testing, it requires the tracking of defects at all phases of the development cycle, including the
design reviews, code inspections, and formal verifications before testing. Because a large
percentage of programming defects is related to design problems, conducting formal reviews or
functional verifications to enhance the defect removal capability of the process at the front end
reduces error injection. The pattern of phase-based defect removal reflects the overall defect
removal ability of the development process.
Figure shows the patterns of defect removal of two development projects: project A was front-end
loaded and project B was heavily testing-dependent for removing defects. In the figure, the various
phases of defect removal are high-level design review (I0), low-level design review (I1), code
inspection (I2), unit test (UT), component test (CT), and system test (ST). As expected, the field
quality of project A outperformed project B significantly.
Figure: Defect Removal by Phase for Two Products
Defect Removal Effectiveness
Defect removal effectiveness (or efficiency, as used by some writers) can be defined as follows:
Because the total number of latent defects in the product at any given phase is not known, the
denominator of the metric can only be approximated. It is usually estimated by:
The metric can be calculated for the entire development process, for the front end, and for each
phase. It is called early defect removal and phase effectiveness when used for the front end and for
specific phases, respectively. The higher the value of the metric, the more effective the development
process and the fewer the defects escape to the next phase or to the field. This metric is a key concept
of the defect removal model for software development. Figure shows the DRE by phase for a real
software project. The weakest phases were unit test (UT), code inspections (I2), and component test
(CT). Based on this metric, action plans to improve the effectiveness of these phases were
established and deployed.
Figure Phase Effectiveness of a Software Project
Metrics for Software Maintenance
When development of a software product is complete and it is released to the market, it enters the
maintenance phase of its life cycle. During this phase the defect arrivals by time interval and customer
problem calls by time interval are the de facto metrics. However, the number of defect or problem
arrivals is largely determined by the development process before the maintenance phase. Not much
can be done to alter the quality of the product during this phase.
Therefore, these two de facto metrics, although important, do not reflect the quality of software
maintenance. What can be done during the maintenance phase is to fix the defects as soon as possible
and with excellent fix quality. Such actions, although still not able to improve the defect rate of the
product, can improve customer satisfaction to a large extent.
Fix Backlog and Backlog Management Index
Fix backlog is a workload statement for software maintenance. It is related to both the rate of defect
arrivals and the rate at which fixes for reported problems become available. It is a simple count of
reported problems that remain at the end of each month or each week. Using it in the format of a
trend chart, this metric can provide meaningful information for managing the maintenance process.
Another metric to manage the backlog of open, unresolved, problems is the backlog management
index (BMI).
As a ratio of number of closed, or solved, problems to number of problem arrivals during the month,
if BMI is larger than 100, it means the backlog is reduced. If BMI is less than 100, then the backlog
increased.
With enough data points, the techniques of control charting can be used to calculate the backlog
management capability of the maintenance process. More investigation and analysis should be
triggered when the value of BMI exceeds the control limits. Of course, the goal is always to strive
for a BMI larger than 100. A BMI trend chart or control chart should be examined together with
trend charts of defect arrivals, defects fixed (closed), and the number of problems in the backlog.
Figure is a trend chart by month of the numbers of opened and closed problems of a software
product, and a pseudo-control chart for the BMI. The latest release of the product was available to
customers in the month for the first data points on the two charts. This explains the rise and fall of
the problem arrivals and closures. The mean BMI was 102.9%, indicating that the capability of the
fix process was functioning normally. All BMI values were within the upper (UCL) and lower (LCL)
control limits the backlog management process was in control.
Figure: Opened Problems, Closed Problems, and Backlog Management Index by Month
A variation of the problem backlog index is the ratio of number of opened problems to number of
problem arrivals during the month. If the index is 1, that means the team maintains a backlog the
same as the problem arrival rate. If the index is below 1, that means the team is fixing problems faster
than the problem arrival rate. If the index is higher than 1, that means the team is losing ground in
their problem-fixing capability relative to problem arrivals. Therefore, this variant index is also a
statement of fix responsiveness.
Fix Response Time and Fix Responsiveness
For many software development organizations, guidelines are established on the time limit within
which the fixes should be available for the reported defects. Usually, the criteria are set in
accordance with the severity of the problems. For the critical situations in which the customers'
businesses are at risk due to defects in the software product, software developers or the software
change teams work around the clock to fix the problems. For less severe defects for which
circumventions are available, the required fix response time is more relaxed. The fix response time
metric is usually calculated as follows for all problems as well as by severity level:
Mean time of all problems from open to closed
If there are data points with extreme values, medians should be used instead of mean. Such cases
could occur for less severe problems for which customers may be satisfied with the circumvention
and didn't demand a fix. Therefore, the problem may remain open for a long time in the tracking
report.
In this writer's knowledge, the systems software development of Hewlett-Packard (HP) in California
and IBM Rochester's systems software development have fix responsiveness processes similar to the
process just illustrated by the automobile examples. In fact, IBM Rochester's practice originated
from a benchmarking exchange with HP some years ago. The metric for IBM Rochester's fix
responsiveness is operationalized as percentage of delivered fixes meeting committed dates to
customers.
Percent Delinquent Fixes
The mean (or median) response time metric is a central tendency measure. A more sensitive metric
is the percentage of delinquent fixes. For each fix, if the turnaround time greatly exceeds the required
response time, then it is classified as delinquent:
This metric, however, is not a metric for real-time delinquent management because it is for closed
problems only. Problems that are still open must be factored into the calculation for a real-time
metric. Assuming the time unit is 1 week, we propose that the percent delinquent of problems in the
active backlog be used. Active backlog refers to all opened problems for the week, which is the sum
of the existing backlog at the beginning of the week and new problem arrivals during the week. In
other words, it contains the total number of problems to be processed for the week the total
workload. The number of delinquent problems is checked at the end of the week. Figure shows the
real-time delivery index diagrammatically.
Figure Real-Time Delinquency Index
It is important to note that the metric of percent delinquent fixes is a cohort metric. Its denominator
refers to a cohort of problems (problems closed in a given period of time, or problems to be
processed in a given week). The cohort concept is important because if it is operationalized as a
cross-sectional measure, then invalid metrics will result. For example, we have seen practices in
which at the end of each week the number of problems in backlog (problems still to be fixed) and the
number of delinquent open problems were counted, and the percent delinquent problems was
calculated. This cross-sectional counting approach neglects problems that were processed and
closed before the end of the week, and will create a high delinquent index when significant
improvement is made.
Fix Quality
Fix quality or the number of defective fixes is another important quality metric for the maintenance
phase. From the customer's perspective, it is bad enough to encounter functional defects when
running a business on the software. It is even worse if the fixes turn out to be defective. A fix is
defective if it did not fix the reported problem, or if it fixed the original problem but injected a new
defect. For mission-critical software, defective fixes are detrimental to customer satisfaction. The
metric of percent defective fixes is simply the percentage of all fixes in a time interval that are
defective.
A defective fix can be recorded in two ways: Record it in the month it was discovered or record it
in the month the fix was delivered. The first is a customer measure, the second is a process
measure. The difference between the two dates is the latent period of the defective fix. It is
meaningful to keep track of the latency data and other information such as the number of customers
who were affected by the defective fix. Usually the longer the latency, the more customers are
affected because there is more time for customers to apply that defective fix to their software
system.
There is an argument against using percentage for defective fixes. If the number of defects, and
therefore the fixes, is large, then the small value of the percentage metric will show an optimistic
picture, although the number of defective fixes could be quite large. This metric, therefore, should
be a straight count of the number of defective fixes. The quality goal for the maintenance process, of
course, is zero defective fixes without delinquency.