Conducting Equivalence Testing in Laboratory Applications: Standard Practice For
Conducting Equivalence Testing in Laboratory Applications: Standard Practice For
for the
Development of International Standards, Guides and Recommendations issued by the World Trade Organization Technical Barriers to Trade (TBT) Committee.
Copyright © ASTM International, 100 Barr Harbor Drive, PO Box C700, West Conshohocken, PA 19428-2959. United States
1
E2935 − 16
3.1.6 degrees of freedom, n—the number of independent 3.1.23 test unit, n—the total quantity of material (containing
data points minus the number of parameters that have to be one or more test specimens) needed to obtain a test result as
estimated before calculating the variance. E2586 specified in the test method. See test result. E2282
2 2
3.1.7 equivalence, n—condition that two population param- 3.1.24 variance, σ , s , n—square of the standard deviation
eters differ by no more than predetermined limits. of the population or sample. E2586
3.1.8 intermediate precision conditions, n—conditions un- 3.2 Definitions of Terms Specific to This Standard:
der which test results are obtained with the same test method 3.2.1 bias equivalence, n—equivalence of a population
using test units or test specimens taken at random from a single mean with an accepted reference value.
quantity of material that is as nearly homogeneous as possible, 3.2.2 equivalence limit, E, n—in equivalence testing, a limit
and with changing conditions such as operator, measuring on the difference between two population parameters.
equipment, location within the laboratory, and time. E177
3.2.2.1 Discussion—In certain applications, this may be
3.1.9 mean, n—of a population, µ, average or expected termed practical limit or practical difference.
value of a characteristic in a population – of a sample, X̄ sum 3.2.3 equivalence test, n—a statistical test conducted within
of the observed values in the sample divided by the sample predetermined risks to confirm equivalence of two population
size. E2586 parameters.
3.1.10 percentile, n—quantile of a sample or a population, 3.2.4 means equivalence, n—equivalence of two population
for which the fraction less than or equal to the value is means.
expressed as a percentage. E2586 3.2.5 non-inferiority, n—condition that the difference in
3.1.11 population, n—the totality of items or units of means or variances of test results between a modified testing
material under consideration. E2586 process and a current testing process with respect to a
performance characteristic is no greater than a predetermined
3.1.12 population parameter, n—summary measure of the limit in the direction of inferiority of the modified process to
values of some characteristic of a population. E2586 the current process.
3.1.13 precision, n—the closeness of agreement between 3.2.5.1 Discussion—Other terms used for non-inferior are
independent test results obtained under stipulated conditions. “equivalent or better” or “at least equivalent as.”
E177 3.2.6 paired samples design, n—in means equivalence
3.1.14 quantile, n—value such that a fraction f of the sample testing, single samples are taken from the two populations at a
or population is less than or equal to that value. E2586 number of sampling points.
3.1.15 repeatability, n—precision under repeatability 3.2.6.1 Discussion—This design is termed a randomized
conditions. E177 block design for a general number of populations sampled, and
each group of data within a sampling point is termed a block.
3.1.16 repeatability conditions, n—conditions where inde-
3.2.7 power, n—in equivalence testing, the probability of
pendent test results are obtained with the same method on
accepting equivalence, given the true difference between two
identical test items in the same laboratory by the same operator
population means.
using the same equipment within short intervals of time. E177
3.2.7.1 Discussion—In the case of testing for bias equiva-
3.1.17 repeatability standard deviation (sr), n—the standard lence the power is the probability of accepting equivalence,
deviation of test results obtained under repeatability given the true difference between a population mean and an
conditions. E177 accepted reference value.
3.1.18 sample, n—a group of observations or test results, 3.2.8 two independent samples design, n—in means equiva-
taken from a larger collection of observations or test results, lence testing, replicate test results are determined indepen-
which serves to provide information that may be used as a basis dently from two populations at a single sampling time for each
for making a decision concerning the larger collection. E2586 population.
3.1.19 sample size, n, n—number of observed values in the 3.2.8.1 Discussion—This design is termed a completely
sample. E2586 randomized design for a general number of populations
sampled.
3.1.20 sample statistic, n—summary measure of the ob-
3.2.9 two one-sided tests (TOST) procedure, n—a statistical
served values of a sample. E2586
procedure used for testing the equivalence of the parameters
3.1.21 standard deviation—of a population, σ, the square from two distributions (see equivalence).
root of the average or expected value of the squared deviation 3.3 Symbols:
of a variable from its mean; —of a sample, s, the square root
of the sum of the squared deviations of the observed values in B = bias (7.1.1)
the sample from their mean divided by the sample size dj = difference between a pair of test results at sampling
minus 1. E2586 point j (7.1.1)
d̄ = average difference (7.1.1)
3.1.22 test result, n—the value of a characteristic obtained
by carrying out a specified test method. E2282 D = difference in sample means (6.1.2) (X1.1.2)
2
E2935 − 16
E = equivalence limit (5.2) 3.4.5 TOST, n—two one-sided tests (5.5.1) (Section 6)
E1 = lower equivalence limit (5.2.1) (Section 7) (Section 8) (Appendix X1)
E2 = upper equivalence limit (5.2.1) 3.4.6 UCL, n—upper confidence limit (6.2.5) (7.2.3)
f = degrees of freedom for s (8.1.1) (X1.1.2)
F1–α = (1 – α)th percentile of the F distribution (9.3.1)
4. Significance and Use
fi = degrees of freedom for si (6.1.1)
fp = degrees of freedom for sp (6.1.2) 4.1 Laboratories conducting routine testing have a continu-
^(•) = the cumulative F distribution function (X1.6.3) ing need to make improvements in their testing processes. In
H 0: = null hypothesis (X1.1.1) these situations it must be demonstrated that any changes will
HA: = alternate hypothesis (X1.1.1) not cause an undesirable shift in the test results from the
n = sample size (number of test results) from a popu- current testing process nor substantially affect a performance
lation (5.4) (6.1.3) (7.1.1) (8.1.1) characteristic of the test method. This standard provides
ni = sample size from ith population (6.1.1) guidance on experiments and statistical methods needed to
n1 = sample size from population 1 (6.1.2) demonstrate that the test results from a modified testing process
n2 = sample size from population 2 (6.1.2) are equivalent to those from the current testing process, where
R = ratio of two sample variances (5.5.3)
equivalence is defined as agreement within a prescribed limit,
5 = ratio of two population variances (X1.6.3)
s = sample standard deviation (8.1.1) termed an equivalence limit.
sB = sample standard deviation for bias (8.1.2) 4.1.1 Examples of modifications to the testing process
sd = standard deviation of the difference between two include, but are not limited, to the following:
test results (7.1.1) (1) Changes to operating levels in the steps of the test
sD = sample standard deviation for mean difference method procedure,
(6.1.3) (X1.1.2) (2) Installation of new instruments, apparatus, or sources of
si = sample standard deviation for ith population (6.1.1) reagents and test materials,
s i2 = sample variance for ith population (6.1.1) (3) Evaluation of new personnel performing the testing,
s 21 = sample variance for population 1 (6.1.2) and
s 21 = variance of test results from the current process (4) Transfer of testing to a new location.
(5.5.3) 4.1.2 The equivalence limit, which represents a worst-case
s 22 = sample variance for population 2 (6.1.2) difference, is determined prior to the equivalence test and its
s 22 = variance of test results from the modified process value is usually set by consensus among subject-matter ex-
(5.5.3) perts.
sp = pooled sample standard deviation (6.1.2)
sr = repeatability sample standard deviation (6.2) 4.2 Two principal types of equivalence are covered in the
t = Student’s t statistic (6.1.4) (7.1.3) (8.1.3) practice, means equivalence and non-inferiority. Means
t 12α,f = (1-α)th percentile of the Student’s t distribution equivalence implies that a sustained shift in test results
with f degrees of freedom (X1.1.2) between the modified and current testing processes refers to an
Xij = jth test result from the ith population (6.1) absolute difference, meaning differences in either direction
UCLR = = upper confidence limit for 5 (9.3.1) from zero. Non-inferiority is concerned with a difference only
X̄ = test result average (8.1.1) in the direction of an inferior outcome in a performance
¯
X = test result average for the ith population (6.1.1) characteristic of the modified testing procedure versus the
i
¯
X1 = test result average for population 1 (6.1.3) current testing procedure.
¯
X2 = test result average for population 2 (6.1.3) 4.2.1 Equivalence testing is performed by an experiment
Z 12α = (1-α)th percentile of the standard normal distribu- that generates test results from the modified and current testing
tion (X1.6.1) procedures on the same materials that are routinely tested. An
α = consumer’s risk (5.2.3) (6.2) (7.2) exception is bias equivalence where the experiment consists of
β = producer’s risk (5.4.1) conducting multiple testing on a certified reference material
∆ = true mean difference between populations (5.4.1) (CRM) having an accepted reference value (ARV) to evaluate
µ = population mean (X1.4.1) the test method bias.
µi = ith population mean (X1.1.1)
ν = approximate degrees of freedom for sD (X1.1.4) 4.2.2 Examples of performance characteristics directly ap-
σ = standard deviation of the test method (5.2) plicable to the test method are bias, precision, sensitivity,
σd = standard deviation of the true difference between specificity, linearity, and range. Additional characteristics are
two populations (7.2) test cost and elapsed time to conduct the test procedure.
Φ(•) = standard normal cumulative distribution function 4.2.3 Non-inferiority may involve trade-offs in performance
(X1.6.1) characteristics between the modified and current procedures.
For example, the modified process may be slightly inferior to
3.4 Acronyms:
the established process with respect to assay sensitivity or
3.4.1 ARV, n—accepted reference value (5.3.3) (8.1) (X1.4)
precision but may have off-setting advantages such as faster
3.4.2 CRM, n—certified reference material (5.3.3) (8.1) delivery of results or lower testing costs.
3.4.3 ILS, n—interlaboratory study (6.2) 4.3 Risk Management—Guidance is also provided for deter-
3.4.4 LCL, n—lower confidence limit (6.2.5) (7.2.3) mining the amount of data required to control the risks of
3
E2935 − 16
making the wrong decision in accepting or rejecting equiva- independent test results are usually generated in a single
lence (see Section X1.2). laboratory by both testing procedures under repeatability
4.3.1 The consumer’s risk is the risk of falsely declaring conditions. For method transfer each laboratory generates
equivalence. The probability associated with this risk is di- independent test results using the same testing procedure,
rectly controlled to a low level so that accepting equivalence preferably under repeatability conditions. If this is not possible
gives a high degree of assurance that the true difference is less due to constraints on time or facilities, then the test results can
than the equivalence limit. be conducted under intermediate precision conditions, but a
4.3.2 The producer’s risk is the risk of falsely rejecting statistician is recommended for design and analysis of the test.
equivalence. The probability associated with this risk is con- 5.3.2 The Paired Samples Design for means equivalence is
trolled by the amount of data generated by the experiment. If discussed in Section 7. In this design, multiple pairs of single
valid improvements are rejected by equivalence testing, this test results from each testing procedure are generated under
can lead to opportunity losses to the company and its labora- different conditions of a second variable, such as time of
tories (the producers) or cause unnecessary additional effort in process sampling. This design is most useful when there are
improving the testing process. constraints on conducting the two independent samples design.
5.3.3 The design for bias equivalence is discussed in Sec-
5. Planning and Executing the Equivalence Study tion 8. In this design test results are generated by the current
5.1 This section discusses the stages of conducting an testing process on a certified reference material (CRM) having
equivalence test: (1) determining the information needed, (2) an accepted reference value (ARV) for the material character-
setting up and conducting the study design, and (3) performing istic of interest.
the statistical analysis of the resulting data. The study is usually 5.3.4 The statistical analysis for non-inferiority is discussed
conducted either in a single laboratory or, in the case of a in Section 9 for evaluating two testing procedures with respect
method transfer, in both the originating and receiving labora- to a performance characteristic. The data can be generated by
tories. Using multiple laboratories will almost always increase either of the designs discussed in Sections 6 and 7.
the inherent variability of the data in the study, which will 5.4 Sample size in the design context refers to the number n
increase the cost of performing the study due to the need for of test results required by each testing process to manage the
more data. producer’s risk. It is possible to use different sample sizes for
5.2 Prior information required for the study design includes the modified and current test processes, but this can lead to
the equivalence limit E, the consumer’s risk α, and an estimate poor control of the consumer’s risk (see X1.1.4).
of the test method precision σ. 5.4.1 The number of test results, symbol n, from each
5.2.1 For means equivalence tests there are two equivalence testing process controls the producer’s risk β of falsely reject-
limits, –E and E, that are tested. Limits may be nonsymmetrical ing means equivalence at a given true mean difference, . The
around zero, such as –E1 and E2, but this is not usual and would producer’s risk may be alternatively stated in terms of the
require advice from a qualified statistician for a proper design power, the probability 1–β of correctly accepting equivalence
setup. For non-inferiority tests only one of these limits is at a given value of .
tested. 5.4.1.1 For symmetric equivalence limits in means equiva-
5.2.2 A prior estimate of the test method precision is lence tests the power profile plots the probability 1–β against
essential for determining the number of test results required in the absolute value of , due to the symmetry of the equivalence
the study design for adequate producer’s risk control. This limits. This calculation can be performed using a spreadsheet
estimate can be available from method development work, computer package (see X1.6.1 and Appendix X2).
from an interlaboratory study, or from other sources. The 5.4.1.2 An example of a set of power profiles in means
precision estimate should take into account the test conditions equivalence tests is shown in Fig. 1. The probability scale for
of the study, such as repeatability, intermediate, or reproduc- power on the vertical axis varies from 0 to 1. The horizontal
ibility conditions. axis is the true absolute difference . The power profile, a
5.2.3 The consumer’s risk may be determined by an indus- reversed S-shaped curve, should be close to a power probabil-
try norm or a regulatory requirement. A probability value often ity of 1 at zero absolute difference and will decline to the
used is α = 0.05, which is a 5 % risk to the consumer that the consumer risk probability at an absolute difference of E. Power
study falsely declares equivalence. for absolute differences greater than E are less than the
5.3 The design type determines how the data are collected consumer risk and decline asymptotically to zero as the
and how much data are needed to control the risk of a wrong absolute difference increases.
decision. A sufficient quantity of a homogeneous material for 5.4.1.3 In Fig. 1 power profiles are shown for three different
the required number of tests is necessary. For comparing data sample sizes for testing means equivalence. Increasing the
from the modified and current testing processes, two basic sample size moves the power curve to the right, giving a
designs are discussed in this practice, the Two Independent greater chance of accepting equivalence for a given true
Samples Design, and the Paired Samples Design. These de- difference . Equations for power profiles are shown in Section
signs are suitable for determining either means equivalence or X1.5 and a spreadsheet example in Appendix X2.
non-inferiority. 5.4.2 Power curves for bias equivalence and non-inferiority
5.3.1 The Two Independent Samples Design for means are constructed by different formulas but have the same shape
equivalence is discussed in Section 6. In this design sets of and interpretation as those for means equivalence.
4
E2935 − 16
5.4.2.1 For non-inferiority testing the power profile plots the 5.5.1.1 The conventional Student’s t test based on the null
probability 1–β against the true difference for means (see hypothesis of a zero difference is not recommended for means
X1.6.2) or against the true variance ratio 5 for variances (see equivalence testing as it does not properly control the consum-
X1.6.3). er’s and producer’s risks for this application (see Section
5.4.3 Power curves are evaluated by entering different X1.3). This test is suitable for supporting superiority of the
values of n and evaluating the curve shape. A practical solution modified process versus the established process instead of
is to choose n such that the power is above a 0.9 probability out equivalence.
to about one-half to two-thirds of the distance to E, thus giving 5.5.1.2 For bias equivalence the calculation for sD is based
a high probability that equivalence will be demonstrated for a
on only a single set of data because the ARV is considered as
range of true absolute differences that are deemed of little or no
a known mean with zero variability for the purpose of the
scientific import in the test result.
equivalence study.
5.5 The statistical analysis for accepting or rejecting 5.5.2 The data analysis for non-inferiority testing of popu-
equivalence is similar for all cases and depends on the outcome lation means uses a single one-sided test in the direction of an
of one-sided statistical hypothesis tests for means and vari-
inferior outcome with respect to a performance characteristic
ances. The calculations are given in detail with examples in
determined by the test results. When the performance charac-
Sections 6 – 9. The statistical theory is given in an appendix
teristic is defined as “higher is better”, such as method
(see Section X1.1).
sensitivity, the statistical test supports noninferiority when
5.5.1 The data analysis for means equivalence testing in this
LCL.2E. Conversely, when the performance characteristic is
practice uses a statistical methodology termed the two one-
sided tests (TOST) procedure. This is based on calculating defined as “lower is better”, such as incidence of
confidence limits for the true mean difference as D6t s D , misclassifications, the statistical test supports noninferiority
where D is the difference between the two test result averages, when UCL,E . Note that the means equivalence procedure
sD is the standard error of that difference, and t is a tabulated comprises two one-sided statistical tests while the non-
multiplier based on the number of data and a preselected inferiority procedure performs only a single one-sided statisti-
confidence level. The calculation for sD is based on the cal test. For statistical details see Section X1.5.
standard deviations of the two sets of data and the type of study 5.5.3 For the equivalence testing of precision the variance is
design. Then equivalence is supported if both of the following used, and “lower is better” for this parameter, so the test for
two conditions are met: non-inferiority applies. Because variances are a scale
(1) The lower confidence limit, LCL5D2t s D , is greater parameter, the non-inferiority test is based the ratio R of the
than the lower equivalence limit, –E, and
two sample variances instead of their difference; thus R
(2) The upper confidence limit, UCL5D1t s D , is less than
5s 22 ⁄s 21 , where s 21 and s 22 are the calculated variances of the test
the upper equivalence limit, E.
NOTE 1—Historically, this procedure originated in the pharmaceutical
results from the current and modified test processes, respec-
industry for use in bioequivalence trials (1, 2),4 denoted as the Two tively. An upper confidence limit for the true variance ratio
One-Sided Tests Procedure, which has since been adopted for use in σ 22 ⁄σ 21 , denoted UCLR, for the given confidence level and sample
testing and measurement applications (3, 4). sizes, can be found from the tabulated F distribution. The
non-inferiority limit E is also in the form of a ratio. For
4
The boldface numbers in parentheses refer to a list of references at the end of example, if E52 , the noninferiority limit would allow the
this standard. modified process to have up to twice the variance of the
5
E2935 − 16
established process or up to about 1.4 times the standard TABLE 1 Data for Equivalence Test Between Two Laboratories
deviation in the worst case. The statistical test supports Test Results
noninferiority if UCL R ,E . Laboratory 1 96.9 97.9 98.5 97.5 97.7 97.2
Laboratory 2 97.8 97.6 98.1 98.6 98.6 98.9
sD 5 sp Œ 1
1
1
n1 n2
(8)
s 22 5 f s 97.8 2 98.27d 2 1 ... 1 s 98.9 2 98.27d 2 g ⁄ s 6 2 1 d
50.26267
s 1 5 œ0.3136750.560
If n1 = n2 = n, then: s 2 5 œ0.2626750.513
sD 5 sp Œ 2
n
f i 5n i 21562155
The estimates of standard deviation are in good agreement
6.1.4 Test for Equivalence—Compute the upper (UCL) and with the ILS estimate of 0.5 mg/g.
lower (LCL) confidence limits for the 100 (1–2α) % two-sided 6.2.3 The pooled standard deviation is:
confidence interval on the true difference. If the confidence
interval is completely contained within the equivalence limits s p5 Œs 6 2 1 d 0.313671 s 6 2 1 d 0.26267
s6 1 6 2 2d
5 Œ 2.8817
10
50.537 mg⁄g
(0 6 E), equivalently if LCL > –E and UCL < E, then accept with 10 degrees of freedom.
equivalence. Otherwise, reject equivalence. 6.2.4 The difference of means is D = 98.27 – 97.62 = 0.65
mg/g. The plant laboratory average is 0.65 mg/g higher than
UCL 5 D1ts D (9)
the development laboratory average. The standard error of the
LCL 5 D 2 ts D (10) difference of means is s D 50.537 =2⁄650.310 mg/g with 10
where t is the upper 100 (1–α) % percentile of the Student’s degrees of freedom (same as that for sp).
t distribution with (n1 + n2 – 2) degrees of freedom. 6.2.5 The 95th percentile of Student’s t with 10 degrees of
6.2 Example for Means Equivalence—The example shown freedom is 1.812. Upper and lower confidence limits for the
is data from a transfer of an ASTM test method from R&D Lab difference of means are:
1 to Plant Lab 2 (Table 1). An equivalence of limit of 2 units UCL = 0.65 + (1.812)(0.310) = 1.21
was proposed with a consumer risk of 5 %. An interlaboratory LCL = 0.65 – (1.812)(0.310) = 0.09
6
E2935 − 16
The 90 % two-sided confidence interval on the true differ- UCL 5 D1ts D (16)
ence is 0.09 to 1.21 mg/g and is completely contained within LCL 5 D 2 ts D (17)
the equivalence interval of –2 to 2 mg/g. Since 0.09 > –2 and
1.21 < 2, equivalence is accepted. where t is the upper 100(1-α) % percentile of the Student’s
t distribution with (n − 1) degrees of freedom.
7. The TOST Procedure for Statistical Analysis of Means
Equivalence — Paired Samples Design 7.2 Example for Means Equivalence—Total organic carbon
in purified water was measured by an on-line analyzer, wherein
7.1 Statistical Analysis—Let the sample data be denoted as
a water sample was taken directly into the analyzer from the
Xij = the test result from the ith population and the jth block,
pipeline through a sampling port and the test result was
where i = 1 or 2. Each block represents a pair of single test
determined by a series of operations within the instrument. A
results from each population. For example, the blocking factor
may be time of sampling from a process. The equivalence limit new analyzer was to be qualified by running a TOC analysis at
E, consumer’s risk α, and sample size (number of blocks, the same time as the current analyzer utilizing a parallel
symbol n) have been previously determined (see Section 5). sampling port on the pipeline. The sampling time was the
7.1.1 Calculate the n differences, symbol dj, between the blocking factor, and the data from the two instruments consti-
two test results within each block, the average of the tuted a pair of single test results measured at a particular
differences, symbol d̄ , and the standard deviation of the sampling time. Sampling was to be conducted at a frequency of
differences, symbol sd, with its degrees of freedom, symbol f. four hours between sampling periods.
An equivalence limit of 2 parts per billion (ppb), or 4 % of
d j 5 X 1j 2 X 2j ,j 5 1,..., n (11)
the nominal process average of 50 ppb, was proposed with a
n
Σ j51 dj consumer risk of 5 %. A repeatability estimate of sr = 0.7 ppb,
d̄ 5 5D (12)
n based on previous validation work, gave an estimate for σd=
sd 5 Œ n
Σ j51 ~ d j 2 d̄ !
~n 2 1!
2
(13)
0.7√2 or approximately 1 ppb. Thus E = 2 ppb, α = 0.05, and
σd = 1 ppb were inputs for this study.
7.2.1 Sample Size Determination—Because the paired
f5n21 (14) samples design uses the differences of the test results within
7.1.2 Calculate the standard error of the mean difference, sampling periods for data analysis, the sample size equals the
symbol sD. number of pairs for purposes of calculating the power curve. In
sd
this example, the cost of obtaining test results was not a major
sD 5 (15) consideration once the new analyzer was installed in the
=n system. Comparative power profiles for n = 10, 20, and 50
7.1.3 Test for Equivalence—Compute the upper (UCL) and sample pairs are shown in Fig. 2. The sample size of 20 pairs
lower (LCL) confidence limits for the 100(1–2α) % two-sided yielded a satisfactory power curve, in that the probability of
confidence interval on the true difference. If the confidence accepting equivalence was greater than a 0.9 (or a 90 % power)
interval is completely contained within the equivalence limits for a true difference of about 1.25 ppb. Therefore, there would
(0 6 E), or equivalently if LCL > –E and UCL < E, then accept be less than an estimated 10 % risk to the producer that such a
equivalence. Otherwise, reject equivalence. difference would fail to support equivalence in the actual trial.
7
E2935 − 16
7.2.2 Test results for the two instruments at each of the 20 denoted as Xi = the ith test result. The format is similar to that
sampling times are listed in Table 2. The current analyzer was for the means equivalence example in Section 6, but the CRM
designated as Instrument A, and the new analyzer was desig- substitutes for the first population, and its ARV is treated as a
nated as Instrument B. The differences dj at each sampling time known constant. This assignment gives the correct sign for the
period were calculated and listed in Table 2 as differences in test method bias.
the test results of Instrument B minus Instrument A. The 8.1.1 The equivalence limit E, consumer’s risk α, and
averages and standard deviations of the test results for each sample sizes have been previously determined. Calculate the
analyzer and their differences are also listed in Table 2. average, estimated bias, standard deviation, and degrees of
7.2.2.1 The average difference d̄ was 0.46 ppb and the freedom:
standard deviation of the differences sd was 1.05 ppb with f = n
19 degrees of freedom. The standard error of the average (X i
i51
difference was: X̄ 5 (18)
n
1.05
sD 5 5 0.235 ppb B 5 X̄ 2 ARV (19)
=20
7.2.2.2 Note that the standard deviations of test results for
each analyzer over time were about 6 ppb due to process
s5 Œ( ~
n
i51
2
X i 2 X̄ ! ⁄ ~ n 2 1 ! (20)
fluctuations in a range of 37–59 ppb. The source of variation f 5 ~ n 2 1 ! degrees of freedom ~ d f ! (21)
due to blocks (sampling times from the process) is eliminated
8.1.2 Calculate the standard error of the bias:
in the variation of the differences by pairing the test results.
7.2.3 The 95th percentile of Student’s t with 19 degrees of sB 5 s ⁄ =n (22)
freedom was 1.729. Upper and lower confidence limits for the
difference of means were: 8.1.3 Test for Equivalence—Calculate upper and lower con-
fidence limits:
UCL 5 D1ts D 5 0.461 ~ 1.729!~ 0.235! 5 0.87 ppb
UCL 5 B1ts B (23)
LCL 5 D 2 ts D 5 0.46 2 ~ 1.729!~ 0.235! 5 0.05 ppb
LCL 5 B 2 ts B (24)
The 90 % two-sided confidence interval on the true differ-
ence is 0.05 to 0.87 ppb and is completely contained within the where t is the upper 100(1–α) percentile of the Student’s t
equivalence interval of –2 to 2 ppb. Since 0.05 > –2 and 0.87 distribution with (n1 – 1) degrees of freedom.
< 2, equivalence of the two analyzers is accepted. If the 100(1–2α) two-sided confidence interval on the true
difference is completely contained within the equivalence
8. The TOST Procedure for Statistical Analysis of Bias limits (0 6 E), equivalently if LCL > –E and UCL < E,
Equivalence
equivalence is accepted. Otherwise, reject equivalence.
8.1 Statistical Analysis—A number of tests are conducted on
a certified reference material (CRM) in a laboratory. The 8.2 Example for Bias Equivalence—The accepted reference
average of the test results is compared with the accepted value for the test material was given as 49.50 % by weight
reference value (ARV) for that material. Let the data be (wt%). An estimate of the repeatability precision from the
method development validation was 1.5 wt%. An equivalence
limit of 3.0 wt% was selected, based on the specification range
TABLE 2 Data for Paired Samples Equivalence Test
for that material, at 5 % consumer risk. Thus E = 3 wt%, α =
TOC in Water, ppb 0.05, and estimated σ = 1.5 wt% are inputs for this study.
Sampling Time
Inst A Inst B Diff
1 46.4 48.8 2.4 8.2.1 Sample Size Determination—Power profiles for n = 5,
2 44.2 43.5 –0.7 12, and 30 were generated for a set of absolute difference
3 52.4 53.0 0.6
4 37.6 37.3 –0.3 values ranging 0.00 (0.25) 4.00 wt% as shown in Fig. 3. All
5 49.3 49.1 –0.2 three curves intersect at the point (3, 0.05) as determined by the
6 45.0 44.5 –0.5 consumer’s risk at the equivalence limit.
7 51.4 51.3 –0.1
8 57.6 56.8 –0.8 8.2.1.1 A sample size of 12 replicate assays yields a
9 43.4 44.9 1.5 satisfactory power curve, in that the probability of accepting
10 45.2 44.1 –1.1
11 59.0 58.5 –0.5 equivalence (power) was greater than a 0.9 probability (or a 90
12 43.1 44.1 1.0 % power) for a difference of 1.75 wt% or less. Therefore, there
13 39.3 40.9 1.6 would be less than an estimated 10% risk to the producer that
14 48.2 48.4 0.2
15 48.7 49.0 0.3 such a difference would fail to support equivalence in the
16 44.4 46.1 1.7 actual trial.
17 52.7 53.2 0.5
18 43.3 44.6 1.3 8.2.1.2 A comparison of the three power curves indicates
19 54.4 56.7 2.3 that the n = 5 design would be underpowered, as the power
20 58.4 58.4 0.0 falls below 0.9 at 1.0 wt%. The n = 30 design gives somewhat
Average 48.20 48.66 0.46
Std Dev 6.13 5.99 1.05 more power than the n = 12 design but is more costly to
conduct and may not be worth the extra expenditure.
8
E2935 − 16
8.2.2 Results for the twelve replicate assays are given in 9.1.1 Depending on the experimental design that was used,
Table 3. The laboratory mean, the bias, the laboratory standard calculate the upper (UCL) and lower (LCL) confidence limits
deviation, its degrees of freedom, and the standard error of the on the difference between means. For the Two Independent
bias are: Samples Design, use the calculations in 6.1. For the Paired
X̄5 s 48.5 1 51.0 1 .... 1 48.9d 550.49 wt% Samples Design, use the calculations in 7.1.
9.1.2 For a performance characteristic where “higher is
B550.49249.5050.99 wt% better”, accept noninferiority for the modified test procedure
s5 œf s 48.5 2 50.49d 2 1 ... 1 s 48.9 2 50.49d 2 g ⁄ s 12 2 1 d
with respect to the current test procedure when LCL.2E;
51.935 wt% otherwise denote inferiority for the modified test procedure.
f51221511
9.1.3 For a performance characteristic where “lower is
better”, accept noninferiority for the modified test procedure
s B 51.935 ⁄ œ1250.559 wt% with respect to the current test procedure when UCL,E ;
8.2.3 The 95th percentile of Student’s t with 11 degrees of otherwise denote inferiority for the modified test procedure.
freedom is 1.796. Upper and lower confidence limits are: 9.2 Example—Non-Inferiority Test for Sensitivity of
UCL = 0.99 + (1.796)(0.559) = 1.99 wt% Detection—Environmental testing for microbial contamination
LCL = 0.99 – (1.796)(0.559) = –0.01 wt% by the current (compendial) test method involves counting
8.2.4 Since –0.01 > –3 and 1.99 < 3, equivalence is microbial colony-forming units (CFU) after plating and incu-
accepted. bating the sample for a period of days. Newer rapid test
methods give a result in shorter time and so have benefits in
9. Procedure for Statistical Analysis of Non-Inferiority timeliness even though they might have slightly lower detec-
Tests Involving Means and Variances tion sensitivity than the compendial method. Therefore, the
9.1 Statistical Analysis Involving Means—The calculations performance characteristic, sensitivity, is “higher is better” and
for non-inferiority tests are essentially the same as for means the non-inferiority test is based on LCL.2E.
equivalence with the following exceptions. 9.2.1 In this example the acceptance criterion is based on a
(1) The means being compared are from values of a ratio rather than on a difference. The industry standard USP
performance characteristic, not necessarily the test result <1223> stipulates that “The alternate method should provide
means. an estimate of viable microorganisms not less than 70 % of the
(2) The scale for a performance characteristic is estimate provided by the traditional method …”, thus the
directional, one direction denoting inferiority of the of the noninferiority limit for the ratio of the CFU counts (rapid/
modified test procedure. Thus only a single one-sided test is compendial) would be 2E50.7. For this situation, a logarithmic
conducted. transformation gives a natural scale for this acceptance crite-
rion in terms of a mean difference. Let X̄ 1 5 the average count
by the rapid method and let X̄ 2 5 the average count by the
TABLE 3 Data for Bias Equivalent Test
compendial method. In the log metric, the log of the ratio is
Test Results
equal to the difference in the log means thus log10~ X̄ 1 ⁄ X̄ 2 !
48.5 51.0 54.0 53.2 47.6 49.4
50.2 49.5 52.1 51.6 49.9 48.9 5log10~ X̄ 1 ! 2log10~ X̄ 2 ! 5D . Therefore, the equivalence limit –E
is equal to log10~ 0.7! 5-0.1549 in the log metric.
9
E2935 − 16
9.2.2 Eighteen independent bioassays were conducted at the and s 22 is the variance estimate of the modified procedure with
same time, nine each by the compendial and rapid test f2 degrees of freedom:
methods, sampling from a single microorganism suspension
R 5 s 22 ⁄s 21 (25)
having approximately 50 CFU. This was a Two Independent
Samples Design. The 6.1 calculations were made on the The upper confidence limit for the true variance ratio 5 =
log-transformed count data using an equivalence limit of -E σ 22 ⁄σ 21 is:
5-0.1549, and these calculations are summarized in Table 4.
The average recovery by the rapid method was lower than the UCL R 5 R F 12α (26)
compendial method (50.4 CFU versus 54.3 CFU) with a ratio
of 0.928, or a 7.2 % reduction. The lower confidence limit where F1-α is the upper 100(1-α)th percentile of the F
(LCL) on the log difference D was –0.0828, which was higher distribution with f1 and f2 degrees of freedom (see X1.5.3).
than the equivalence limit –0.1549, and thus non-inferiority 9.3.2 Test for Non-Inferiority of Population 2 Precision—
was supported. Because precision stated inversely as variance is a performance
9.2.3 Note that the use of a normal distribution for log characteristic where “lower is better”, accept noninferiority for
counts was justified in this situation because the count range is the modified test procedure with respect to the current test
small. This was confirmed by a normality test on each source procedure when UCL R ,E ; otherwise denote inferiority for the
of nine log-transformed counts (not shown here). modified test procedure.
9.2.4 Fig. 4 shows a post-facto power curve based on n = 9, 9.3.3 The needed sample sizes for variance tests will be
α = 0.05, and σ = 0.06 log CFU. The curve intersects the point much larger than those for means. It will usually be difficult, if
(–0.1549, 0.05) confirming that the power is 5 % at the given
not impossible, to generate 30 or more test results at the same
equivalence limit. Power is above 90 % at a log CFU Ratio
time by each test method under repeatability conditions. This
down to near –0.1 (about a 20 % reduction in sensitivity) for
this design. This supports the sample size that was used for this means that the tests will be conducted under intermediate
non-inferiority test. precision conditions using a set of control samples that are
homogeneous and stable. Fig. 5 shows power curves for equal
9.3 Statistical Analysis Involving Variances—The test for sample sizes n = 31, 51, and 101 (degrees of freedom, f = 30,
non-inferiority of precision is conducted using variances as the
50, and 100) with α = 0.05 and E = 4. A power of 0.8 can be
test statistic. Non-inferiority tests are used for variances
attained at 5 = 1.6, 2.0, and 2.4, respectively. Note that the
because precision is a performance characteristic in which
three power profile curves intersect at the point (4, 0.05).
“smaller is better”. The statistical procedure is based on a
one-sided F test. The proper design is the Two Independent 9.3.4 If a control sample is unavailable, an alternate design
Samples Design, so use the calculations in 6.1, Eq 2-4, for the would be to run duplicate tests by each test method on a series
variances. of routine samples. Each duplicate will provide a one degree of
9.3.1 Calculate the ratio R of the variances (modified/ freedom estimate of test variance under repeatability condi-
current) and its upper confidence limit, where s 21 is the variance tions. These variances can then be pooled to obtain a repeat-
estimate of the current procedure with f1 degrees of freedom ability estimate for each test method.
n 9 9 9 9
Average 54.3 50.4 1.7313 1.6989 Eq 1
Std Dev 7.52 7.21 0.0605 0.0620 Eq 3
Degrees of Freedom, f 8 8 Eq 4
Pooled Standard Deviation 0.0612 Eq 5
Degrees of Freedom 16 Eq 6
Difference (Rapid-Comp) –0.0325 Eq 7
Standard Error of Difference 0.0289 Eq 8
95 % Confidence Limit:
Student’s t, f = 16, 95th Percentile 1.746
Lower Confidence Limit –0.0828 Eq 10
Equivalence Limit –0.1549
Non-Inferiority Test Pass
10
E2935 − 16
10. Keywords
10.1 bias equivalence; confidence interval; equivalence;
equivalence limit; means equivalence; non-inferiority; two
one-sided tests (TOST) procedure
11
E2935 − 16
APPENDIXES
(Nonmandatory Information)
X1.1 Two One-Sided Tests (TOST) Procedure (1) resulting degrees of freedom are bounded between MIN(n1 – 1,
X1.1.1 Data from two populations (sources) are assumed to n2 – 1) and n1 + n2 – 2.
arise independently from normally distributed populations
having distinct means, denoted as µ1, µ2, and a common X1.2 Decision Errors and Risks
standard deviation, denoted as σ. The TOST procedure sets up X1.2.1 In any statistical hypothesis testing situation a deci-
two null hypotheses (H0) and corresponding alternate hypoth- sion is made to either accept or reject the null hypothesis based
eses (Ha) on the difference between the two population means on outcome of the procedure. Since the data are subject to
as follows: variation, this will create uncertainty in the final decision.
Hypothesis 1 Hypothesis 2 There are two kinds of errors associated with the final decision:
Null hypothesis H 01: µ 2 2µ 1 $E H 02: µ 2 2µ 1 #2E (1) Rejecting the null hypothesis when it is true (Type I
Alternative hypothesis H a1 : µ 2 2µ 1 ,E H a2 : µ 2 2µ 1 .2E
error), and
The value E is termed the equivalence limit, representing the (2) Not rejecting the null hypothesis when it is false (Type
worst case difference between the two means. II error).
X1.1.2 The TOST procedure is carried out using the data
X1.2.2 For the equivalence application of a hypothesis test,
sampled from the two populations, as illustrated in 6.1 with an
example. A one-sided t test at the α significance level tests each the null hypothesis is that the two populations are not
of the two null hypotheses. equivalent, so the Type I error is declaring equivalence when
the two populations are truly not equivalent. The Type I error
Let D5X̄ 2 2X̄ 1 and s D 5s p Œ 1
1
1
n1 n2
is considered a consumer’s risk, since acceptance of a non-
equivalent testing process will affect customers (patients,
where s p 5 Œ
~ n 1 2 1 ! s 21 1 ~ n 2 2 1 ! s 22
~n1 1 n2 2 2!
, with f5 ~ n 1 1 n 2 2 2 ! regulators, etc.) by creating erroneous test results in release of
product and other quality management activities. This risk is
degrees of freedom. set by choosing the significance level of the two hypothesis
The t statistics are t 1 5 ~ E 2 D ! ⁄s D and t 2 5 ~ E 1 D ! ⁄s D for tests in the TOST procedure, so that the consumer’s risk is
hypotheses 1 and 2, respectively. Both null hypotheses are directly controlled.
rejected when t 1 .t 12α,f and t 2 .t 12α,f where t 12α,f is the upper
(1–α)th quantile of the Student’s t distribution with f degrees of X1.2.3 The Type II error is failing to declare equivalence
freedom. If both hypotheses are rejected, then it is asserted that when the two populations are truly equivalent. The Type II
2E,µ 1 2µ 2 ,E and the two sources are said to be equivalent; error is considered a producer’s risk, since this will create
otherwise, the two data sources are deemed non-equivalent. additional investigational work to make a desired improve-
ment. This risk is controlled by choosing an adequate sample
X1.1.3 The TOST procedure is operationally identical to size to be taken from each population by consideration of
constructing a two-sided 100(1–2α) % confidence interval on power profiles from various sample sizes.
the difference between two means (2). If the confidence
interval is completely contained within the interval (–E, E) X1.2.4 The table below summarizes the four situations that
then equivalence is accepted. The interval (–E, E) is termed the may occur for a given TOST procedure.
equivalence interval. Populations are truly:
TOST declares that: Equivalent Not Equivalent
X1.1.4 It is strongly recommended (5) that the sample sizes Populations are equivalent Decision is correct Type I Error
from each population be equal to minimize the effect of a Populations are not equivalent Type II Error Decision is correct
departure from equal population variances. If the variances
differ greatly the standard error of the difference may be X1.3 Criticism of the Use of the Conventional t Test for
calculated as: Equivalence Testing
sD 5 Œ s 21 s 22
1
n1 n2
(X1.1)
X1.3.1 In the conventional two sample t test a single
hypothesis test is set up as follows:
Null hypothesis H 0 : µ 1 2µ 2 50
With approximate degrees of freedom: Alternative hypothesis H a : µ 1 2µ 2 fi0
~ s 21 ⁄ n1 1 s 22 ⁄ n 2 ! 2
The null hypothesis is rejected if the two-sided confidence
v5 (X1.2)
F ~ s 21 ⁄ n 1 ! 2
~n1 2 1!
1
~n2 2 1! G
~ s 22 ⁄ n 2 ! 2 interval on the difference between the population means
excludes zero and is not rejected if the confidence interval
In many statistical software packages this calculation is used includes zero. If used for equivalence testing, equivalence
in the option “assume unequal variances” for a t test. The would be rejected if the null hypothesis was rejected. This is
12
E2935 − 16
operationally the same as rejecting the null hypothesis if the the form of the ratio 5 = σ 22 ⁄σ 21 that represents the worst case
two-sided confidence interval on the mean does not include increase of variance.
zero. X1.5.3.1 The statistical test involves the ratio R5s 22 ⁄s 21 , using
X1.3.2 The Type I Error for the t test is the error of falsely s 21 as the variance estimate of the current procedure with f1
declaring a non-zero difference, or the error of falsely declaring degrees of freedom and s 22 as the variance estimate of the
non-equivalence, which is the producer’s risk. As hypothesis modified procedure with f2 degrees of freedom, is the test
tests are set up to directly control the Type I error (often at the statistic for the one-sided F test. The acceptance criterion for
0.05 significance level) the conventional t test is not directly non-inferiority is R F 12α ,E , where F1–α is the upper 100(1-α)th
protecting the customer in the equivalence application. The percentile of the F distribution with f1 and f2 degrees of
consumer’s risk is indirectly controlled by the samples sizes freedom.
selected. X1.5.4 A reference for non-inferiority procedures is M.
X1.3.3 If the variances of the population means are small, Rothmann, et. al. (6). Although their context is directed to
either reflecting a precise test method, large sample sizes, or clinical trials for pharmaceuticals, many numerical examples
both, the confidence interval on the difference may not include are included, and these are easily translatable to test method
zero, thus rejecting equivalence, even for small differences that evaluation.
are not of scientific importance. On the other hand, if the
variances of the population means are large, the confidence X1.6 Power Profiles
interval on the difference may include zero, but may be
X1.6.1 The power function of the means equivalence test
extremely wide, thus masking critical differences. For these
has been examined (7, 8) where the emphasis is on finding a
reasons, the conventional t test is not recommended for
sample size n for a given value of the true difference in means.
equivalence testing.
Power functions involving the non-central and central Stu-
dent’s t distributions were considered, along with incorporating
X1.4 Equivalence Testing for Bias an upper confidence limit on the variance estimate with a
X1.4.1 The TOST procedure may also be used for bias normal distribution power function. The normal distribution
equivalence testing. In this situation population mean µ1 is the approximation should be adequate when a strong estimate of σ
accepted reference value (ARV) with zero variance. The ex- is used (or the use of an upper confidence limit on σ if a more
periment consists of comparing µ2 with the ARV. The popula- conservative estimate is desired.) The normal approximation of
tion mean is re-designated as µ and the sample mean and power, given a true difference ∆, for equal sample sizes n is:
variance calculated for the single data set is used for estimating
the bias, µ – ARV, and its confidence limits for testing against Power 5 Φ S E2∆
σD
2 z 12α 2 Φ
σDD S
2E 2 ∆
1 z 12α D
the equivalent limit, or worst-case bias. The only change from
(X1.3)
the two population case is the calculation of the standard error
and its degrees of freedom. where:
Φ(•) = the standard normal cumulative distribution function,
X1.5 Equivalence Testing for Non-Inferiority ∆ = µ1 – µ2, the true difference parameter,
σD = σ =2⁄n , the standard error of the test statistic D, and
X1.5.1 Non-inferiority in this practice compares a modified
z1–α = the (1–α)th percentile of the standard normal
testing process to the current process with respect to a
distribution.
performance characteristic, where the acceptance criterion is
stated in terms of a difference in means or a ratio of variances. If the sample sizes are too small, the upper confidence limit
The statistical procedure for non-inferiority testing uses a on –E may exceed the lower confidence limit on E, and there
single one-sided hypothesis test where the null hypothesis will be a zero chance of accepting equivalence.
states that the modified testing process is inferior to the current
X1.6.2 The power function for the non-inferiority test for
process. If the null hypothesis is rejected, the modified process
means depends on the direction of inferiority and uses the
is declared non-inferior to the current process for that perfor-
appropriate part of equation (Section X1.7).
mance characteristic.
For a performance characteristic where “higher is better use:
X1.5.2 For performance characteristics comparing means,
the hypothesis sets in X1.5.1 are used with µ1 defined as the
mean of the current process and µ2 defined as the mean of the
Power 5 1 2 Φ S 2E 2 ∆
σD
1 zα D (X1.4)
modified process. For an acceptance criterion where “lower is For a performance characteristic where “lower is better use:
better” use Hypothesis 1, and for an acceptance criterion where
“higher is better” use Hypothesis 2. The TOST procedure will
supply the necessary one-sided hypothesis test calculations.
Power 5 Φ S E2∆
σD
2 zα D (X1.5)
13
E2935 − 16
where: X1.7 Alternative Designs
^(•) = the cumulative F distribution function with f1 and f2 X1.7.1 Designs conducted using intermediate precision
degrees of freedom, conditions may involve other sources of variation, thus making
E = the equivalence limit expressed as the hypothesized
the analysis more complicated and possibly raising side issues,
ratio σ 22 ⁄σ 21 , and
F1–α = the upper 100(1-α)th percentile of the F distribution such as differences among operators or instruments within
with f1 and f2 degrees of freedom. laboratories (9, 10).
X2.1 Power Profile for Means or Bias Equivalence Using X2.1.2 Calculation—Cells E3 and E4 list results for inter-
a Single Sample or Two Independent Samples of mediate calculations of Zα and σD. The power for a given true
Equal Sample Size difference is calculated from E, ∆, Zα, and σD, and the function
X2.1.1 Data Entry—A spreadsheet example for generating equation for this appears in Row 24. The calculated power
power profiles is shown in Fig. X2.1. See Section X1.6 for curve values will appear in B10 downward
background information. Five input variables are entered into
X2.1.3 Graph—The graph plots the power on the vertical
cells B3–B7 as follows:
axis versus the absolute true difference on the horizontal axis.
• In B3, enter the estimate of the standard deviation of the
The curve is anchored at the point (E, α). For different ranges
test results, σ
• In B4, enter the consumer risk, α for the true difference the axes may have to be altered by the
• In B5, enter 1 for a single sample design or 2 for a two user.
independent samples design
• In B6, enter the equivalence limit, E X2.2 Disclaimer—This spreadsheet example is not sup-
• In B7, enter the sample size, n ported by ASTM, and the user of this standard is responsible
• In A10, downward enter a range of true differences for its use. For questions pertaining to use of this spreadsheet
starting with zero and exceeding the equivalence limit, E, and example please contact Subcommittee E11.20.
adjust the horizontal axis of the graph accordingly.
14
E2935 − 16
REFERENCES
(1) Schuirmann, D. J.,“A Comparison of the Two One-sided Tests (6) Rothmann, M. D., Wiens, B. L., and Chan, S. F. I., Design and
Procedure and the Power Approach for Assessing the Equivalence of Analysis of Non-Inferiority Trials, Chapman & Hall/CRC, Taylor &
Average Bioavailability,” Journal of Pharmacokinetics and Francis Group, Boca Raton, FL, 2012.
Biopharmaceutics, Vol 15, 1987, pp. 657–680. (7) Bristol, D. R., “Probabilities and Sample Sizes for the Two One-sided
(2) Westlake, W. J.,“Response to T. B. L. Kirkwood: Bioequivalence Tests Procedure,” Communications in Statistics – Theory Methods,
Testing – A Need to Rethink,” Biometrics, Vol 37, 1981, pp. 589–594. Vol 22, 1993, pp. 1953–1961.
(3) Limentani, G. B., Ringo, M. C., Ye, F., Bergquist, M. L., and (8) Stein, J., and Doganaksoy, N., “Sample Size Considerations for
McSorley, E. O., “Beyond the t-Test: Statistical Equivalence Testing,” Assessing the Equivalence of Two Process Means,” Quality
Analytical Chemistry, June 1, 2005, pp. 221A–226A. Engineering, Vol 12, No. 1, 1999, pp. 105–110.
(4) Chambers, D., Kelly, G., Limentani, G., Lister, A., Lung, K. R., and (9) Kringle, R., Khan-Malek, R., Snikeris, F., Munden, P., Agut, C., and
Warner, E., “Analytical Method Equivalency – An Acceptable Ana- Bauer, M., “A Unified Approach for Design and Analysis of Transfer
lytical Practice,” Pharmaceutical Technology, September 2005, pp. Studies for Analytical Methods,” Drug Information Journal, Vol 35,
64–80. 2001, pp. 1271–1288.
(5) Welch, B. L., “The Significance of the Difference Between Two (10) Schwenke, J., and O’Connor, D., “Design and Analysis of Analytical
Means When the Population Variances are Unequal,” Biometrika, Vol Method Transfer Studies,” Journal of Pharmaceutical and
29, 1938, pp. 350–362. BioSciences, Vol 18, No. 5, 2008, pp. 1013–1033.
ASTM International takes no position respecting the validity of any patent rights asserted in connection with any item mentioned
in this standard. Users of this standard are expressly advised that determination of the validity of any such patent rights, and the risk
of infringement of such rights, are entirely their own responsibility.
This standard is subject to revision at any time by the responsible technical committee and must be reviewed every five years and
if not revised, either reapproved or withdrawn. Your comments are invited either for revision of this standard or for additional standards
and should be addressed to ASTM International Headquarters. Your comments will receive careful consideration at a meeting of the
responsible technical committee, which you may attend. If you feel that your comments have not received a fair hearing you should
make your views known to the ASTM Committee on Standards, at the address shown below.
This standard is copyrighted by ASTM International, 100 Barr Harbor Drive, PO Box C700, West Conshohocken, PA 19428-2959,
United States. Individual reprints (single or multiple copies) of this standard may be obtained by contacting ASTM at the above
address or at 610-832-9585 (phone), 610-832-9555 (fax), or service@astm.org (e-mail); or through the ASTM website
(www.astm.org). Permission rights to photocopy the standard may also be secured from the Copyright Clearance Center, 222
Rosewood Drive, Danvers, MA 01923, Tel: (978) 646-2600; http://www.copyright.com/
15