p’=(x1-x2)/(n1-n2) q’=1-p’ Requirements for pooled samples proportions: 1.
The sample proportions are from two simple random samples. 2. The two samples are independent. (Samples are independent if the sample
values selected from one population are not related to or somehow naturally paired or matched with the sample values from the other population.) 3. For each of the two samples, there are at least 5 successes and at least 5
failures. (That is, np’’=>5 and nq’’=>5 for each of the two samples). If H0: p1=p2
Round the confidence interval limits to three significant digits. The confidence interval estimate of the difference p1 -p2 is (p’1-p’2)-E < (p1-p2)<(p’1-p’2)+E
s=√(p1q1/n)
The form of the confidence interval requires an expression for the
variance different from the one given above. Two samples are independent if the sample values from one population are not related to or somehow naturally paired or
matched with the sample values from the other population. Two samples are dependent (or consist of matched pairs) if the sample values are
somehow matched, where the matching is based on some inherent relation-ship. (That is, each pair of sample values consists of two measurements
from the same subject—such as before>after data—or each pair of sample values consists of matched
pairs—such as husband>wife data—where the matching is based on some meaningful relationship.) Caution:
“Dependence” does not require a direct cause>effect relationship. If the two samples have different sample sizes with no missing data, they must be
independent. If the two samples have the same sample size, the samples may or may not be independent. Inferences About Two Means: Independent Samples
Requirements 1. The values of s1 and s2 are unknown and we do not assume that they are equal. 2. The two samples are independent. 3. Both samples are simple random samples. 4. Either or
both of these conditions are satisfied: The two sample sizes are both large (with n1>30 and n2>30) or both samples come from populations having normal distributions. (The methods used
here are robust against departures from normality, so for small samples, the normality requirement is loose in the sense that the procedures perform well as long as there are no outliers and departures from
normality are not too extreme) The confidence interval estimate of the difference μ1- μ2 is (x’1-x’2)-E<(μ1- μ2)<(x’1-x’2)+E Alternative methods: The two population standard deviations are unknown but are
assumed to be equal. 2. The two population standard deviations are both known. σ1 =σ2, If we use randomness to assign subjects to treatment and placebo groups, we know that the samples are drawn from the same
population. So if we conduct a hypothesis test assuming that two population means are equal, it is reasonable to assume that the samples are from populations with the same standard deviations (but we should still check that
assumption). The advantage of this alternative method of pooling sample variances is that the number of degrees of freedom is a little higher, so hypothesis tests have more power and confidence intervals are a little narrower.
If s1 is known but s2 is unknown, use the procedures in Part 1 of this section with these changes: Replace s1 with the known value of s1 and use the number of degrees of freedom found from the expression below. Matched
pairs is section where presents methods for testing ypotheses and constructing confidence intervals involving the mean of the differences of the values from two pop-ulations that consist of matched pairs. The
pairs must be matched according to some relationship, such as these: ■Before>after measurements from the same subjects ■IQ scores of husbands and wives ■Measured and reported weights from a sample
of subjects. Many experiments have been conducted to test the effectiveness of drug treatments in lowering blood pressure. When designing such experiments to test the effectiveness of a treatment, there are different
approaches that could be taken, such as these: 1. Measure the blood pressure of each subject before and after the treatment, then analyze the “before – after” differences. 2. For the entire sample of subjects, find the mean
blood pressure before the treat-ment and then find the mean after the treatment. 3. Obtain a random sample of subjects and use randomness to separate them into one sample given the treatment and another sample given a
placebo. An advantage of using the matched pairs from the first approach is that we reduce the extraneous variation, which could easily occur with different independent samples. The strategy for designing an experiment can
be generalized by the following prin-ciple of good design: When designing an experiment or planning an observational study, using matched pairs is generally better than using two independent samples. Déjà Vu All Over
Again The methods of hypothesis testing in this section are the same methods for testing a claim about a population mean, except that here we use the differences from the matched pairs of sample data. There are no exact
procedures for dealing with matched pairs, but the following approximation methods are commonly used. Requirements The sample data are matched pairs. 2. The matched pairs are a simple random sample. 3. Either or both
of these conditions are satisfied: The number of pairs of sample data is large (n>30) or the pairs of values have differences that are from a popula-tion having a distribution that is approximately normal. These methods are
robust against departures for normal-ity, so the normality requirement is loose. d-individual difference between the two values in a single matched pair μ_d-mean value of the differences d for the population of all matched pairs
of data d’-mean value of the differences d for the paired sample data s_d - standard deviation of the differences d for the paired sample data n- number of pairs of sample data d’-E<μ_d<d’+E - CIE Infreneces with
matched pairs: 1. Verify that the sample data consist of matched pairs, and verify that the require-ments in the preceding Key Elements box are satisfied. 2. Find the difference d for each pair of sample values. (Caution: Be
sure to subtract in a consistent manner, such as “before – after”). 3. Find the value of d (mean of the differences) and s_d (standard deviation of the differences). 4. For hypothesis tests and confidence intervals, use the same
t test procedures used for a single population mean. The F test uses the F distribution introduced in this section. The F test requires that both populations have normal distributions. Instead of being robust, this test is very
sensitive to departures from normal distributions, so the normality requirement is quite strict. Requirements 1. The two populations are independent. 2. The two samples are simple random samples. 3. Each of the two
populations must be normally distributed, regardless of their sample sizes. This F test is not robust against departures from normality, so it performs poorly if one or both of the populations have a distribution that is not normal.
The requirement of normal distributions is quite strict for this F test. F=s^2_1/s^2_2 is the larger of the two sample variances P-Values: P-values are automatically provided by technology. If technology is not available, use
the computed value of the F test statistic with Table A-5 to find a range for the P-value. Critical Values: Use Table A-5 to find critical F values that are determined by the following: 1. The significance level a
(Table A-5 includes critical values for a=0.025 and a=0.05.) 2. Numerator degrees of freedom=n1 −1 (determines column of Table A-5) 3. Denominator degrees of freedom=n2 −1 (determines row
of Table A-5) For significance level a=0.05, refer to Table A-5 and use the right-tail area of 0.025 or 0.05, depending on the type of test, as shown below: • Two-tailed test: Use Table A-5 with 0.025
in the right tail. (The significance level of 0.05 is divided between the two tails, so the area in the right tail is 0.025.) • One-tailed test: Use Table A-5 with a=0.05 in the right tail. For two normally
distributed populations with equal variances σ^2_1 =σ^2_2, the sampling distribution of the test statistic F=s^2_1/s^2_2 is the F distribution shown in Figure 9-4 (provided that we have not yet imposed the stipulation
that the larger sample variance is s^2_1). If you repeat the process of selecting samples from two normally distributed populations with equal variances, the distribution of the ratio s^2_1>s^2_2 is the F distribution. If the two
populations have equal variances, then the ratio s^2_1>s^2_2 will tend to be close to 1. Because we are stipulating that s^2_1 is the larger sample variance, the ratio s^2_1>s^2_2 will be a large number whenever s^2_1 and
s^2_2 are far apart in value. Consequently, a value of F near 1 will be evidence in favor of s^2_1 =s^2_2, but a large value of F will be evidence against s^2_1 =s^2_2 Large values of F are evidence against s^2_1 =s^2_2
The count five method is a relatively simple alternative to the F test, and it does not require normally distributed populations. If the two sample sizes are equal, and if one sample has at least five of the largest mean absolute
deviations (MAD), then we conclude that its population has a larger variance. The Levene-Brown-Forsythe test (or modified Levene’s test) is another alternative to the F test, and it is much more robust against departures from
normality. This test begins with a transformation of each set of sample values. Within the first sample, replace each x value with x -median, and apply the same transformation to the second sample. Using the transformed
values, conduct a t test of equality of means for independent samples, as described in Part 1 of Section 9-2. Because the transformed values are now deviations, the t test for equality of means is actually a test comparing
variation in the two samples. See Exercise 18 “Levene-Brown-Forsythe Test.” A correlation exists between two variables when the values of one variable are somehow associated with the values of the other variable. A
linear correlation exists between two variables when there is a correlation and the plotted points of paired data result in a pattern that can be approximated by a straight line Because it is always wise to explore sample data
before applying a formal statisti-cal procedure, we should use a scatterplot to graph the paired data in Table 10-1 and observe if there is a distinct pattern in the plotted points. (Scatterplots were first intro-duced in Section 2-4.)
The scatterplot is shown in Figure 10-1 and there does appear to be a distinct pattern of increasing Powerball ticket sales corresponding to increas-ing jackpot amounts. There do not appear to be any outliers, which are data
points that are far away from the other data points.Because conclusions based on visual examinations of scatterplots are largely subjective, we need more objective measures. We use the linear correlation coefficient r, which
is a number that measures the strength of the linear association between the two variables. The linear correlation coefficient r measures the strength of the linear correlation between the paired quantitative x values and y
values in a sample. The linear correlation coefficient r is computed by using Formula 10-1 or Formula 10-2, included in the following Key Elements box. [The linear correlation coefficient is sometimes referred to as the Pearson
product moment correlation coefficient. Given any collection of sample paired quantitative data, the linear correlation coefficient r can always be com-puted, but the following requirements should be satisfied when using the
sample paired data to make a conclusion about linear correlation in the corresponding population of paired data.1. The sample of paired 1x, y2 data is a simple random sam-ple of quantitative data. (It is important that the
sample data have not been collected using some inappropriate method, such as using a voluntary response sample.)2. Visual examination of the scatterplot must confirm that the points approximate a straight-line pattern. 3.
Because results can be strongly affected by the pres-ence of outliers, any outliers must be removed if they are known to be errors. The effects of any other outli-ers should be considered by calculating r with and without the
outliers included. P-value<=a: Supports the claim of a linear correlation. P-value>a: Does not support the claim of a linear correlation. Correlation If the computed linear correlation coefficient r lies in the left tail at or below the
leftmost critical value or if it lies in the right tail at or above the rightmost critical value (that is, |r|<critical value), conclude that there is sufficient evidence to support the claim of a linear correlation. No Correlation If the
computed linear correlation coefficient lies between the two critical values (that is, |r|<critical value), conclude that there is not sufficient evidence to support the claim of a linear correlation. Properties of the Linear
Correlation Coefficient r 1. The value of r is always between -1 and 1 inclusive. That is, -1<r<1. 2. If all values of either variable are converted to a different scale, the value of r does not change. 3. The value of r is not
affected by the choice of x or y. Interchange all x values and y values, and the value of r will not change. 4. r measures the strength of a linear relationship. It is not designed to measure the strength of a relationship that is not
linear. 5. r is very sensitive to outliers in the sense that a single outlier could dramatically affect its value. Using P-Value from Technology to Interpret r: Use the P-value and signifi-cance level a as follows: P-value<=a:
Supports the claim of a linear correlation. P-value>a: Does not support the claim of a linear correlation. A spurious correlation is a correlation that doesn’t have an actual association The value of r^2 is the proportion of the
variation in y that is explained by the linear relationship between x and y. Correlation does not imply causality! Common Errors Involving Correlation: 1. Assuming that correlation implies causality. 2. Using data
based on averages. 3.Ignoring the possibility of a nonlinear relationship. Positive Correlation: A large positive value of Σ(z_x, z_y) suggests that the points are predominantly in the first and third quadrants. Negative
Correlation: A large negative value of Σ(z_x, z_y) suggests that the points are predominantly in the second and fourth quadrants. No Correlation: A value of Σ(z_x, z_y) near 0 suggests that the points are scat-tered among
the four quadrants (with no linear correlation). Given a collection of paired sample data, the regression line (or line of best fit, or least-squares line) is the straight line that “best” fits the scatterplot of the data. (The specific
criterion for the “best-fitting” straight line is the “least-squares” property described later.) The regression equation y’’=b0+b1x algebraically describes the regression line. The regression equation expresses a relationship
between x (called the explanatory variable, or predictor variable, or independent variable) and y” (called the response variable or dependent variable). b1=r*s_y/s_x b0=y”-b1x’’ Round b1 and b0 to three
significant digits Bad Model: If the regression equation does not appear to be useful for making predictions, don’t use the regression equation for making predictions. MAKING PREDICTIONS Good Model: Use the
regression equation for predictions only if the graph of the regression line on the scatterplot confirms that the regression line fits the points reasonably well. Correlation: Use the regression equation for predictions only if the
linear cor-relation coefficient r indicates that there is a linear correlation between the two variables. Scope: Use the regression line for predictions only if the data do not go much beyond the scope of the available sample data.
(Predicting too far beyond the scope of the available sample data is called extrapolation, and it could result in bad predictions). In working with two variables related by a regression equation, the marginal change in a variable
is the amount that it changes when the other variable changes by exactly one unit. The slope b1 in the regression equation represents the marginal change in y that occurs when x changes by one unit. In a scatterplot, an
outlier is a point lying far away from the other data points. Paired sample data may include one or more influential points, which are points that strongly affect the graph of the regression line. For a pair of sample x and y
values, the residual is the difference between the observed sample value of y and the y value that is predicted by using the regres-sion equation. That is, Residual =observed y-predicted y=y-y’’ A straight line satisfies the
least-squares property if the sum of the squares of the residuals is the smallest sum possible. A residual plot is a scatterplot of the (x, y) values after each of
the y-coordinate values has been replaced by the residual value y-yn (where yn denotes the predicted value of y). That is, a residual plot is a graph of the points
(x, y-y”) A RP helps us determine whether the regression line is a good model of the sample data. USEFULNES OF RP ■A RP helps us to check the requirement
that for different values of x, the corresponding y values all have the same standard deviation. A contingency table (or two-way frequency table) is a table consisting of frequency counts of categorical data corresponding to
two different variables. In a test of independence, we test the null hypothesis that in a contingency table, the row and column variables are independent E=r*c/T A chi-square test of homogeneity is a test of the claim that
different populations have the same proportions of some characteristics.The analysis of variance (ANOVA) methods of this chapter require the F distribution. F distribu-tion has the following properties: 1. There is a different F
distribution for each different pair of degrees of freedom for numerator and denominator. 2. The F distribution is not symmetric. It is skewed right. 3. Values of the F distribution cannot be negative. 4. The exact shape of the F
distribution depends on the two different degrees of freedom. F=s^2_1/s^2_2=n*s^2_x/s^2_p One-way analysis of variance (ANOVA) is a method of testing the equality of three or more population means by
analyzing sample variances. One-way analysis of variance is used with data categorized with one factor (or treatment), so there is one characteristic used to separate the sample data into the different
categories Degrees of Freedom using k=number of samples and n=sample size^2 df=(r-1)(c-1)=k-1(between)=N-k(within) MS-variance AVERAGE from STDEV is s_x! STDEV from AVERAGE is
s_p One way to reduce the effect of the extraneous factors is to design the experiment so that it has a completely randomized design, in which each sample value is given the same chance of belonging to the different factor
groups. Another way to reduce the effect of extraneous factors is to use a rigorously controlled design, in which sample values are carefully chosen so that all other factors have no variability.Some of the tests, called range
tests, allow us to identify subsets of means that are not significantly different from each other. Other tests, called multiple comparison tests, use pairs of means, but they make adjustments to overcome the problem of having a
significance level that increases as the number of individual tests increases. There is no consensus on which test is best, but some of the more common tests are the Duncan test, Student-Newman-Keuls test (or SNK test),
Tukey test (or Tukey honestly significant difference test), Scheffé test, Dunnett test, least significant difference test, and the Bonferroni test.
Decision: If claim contains equality, and we reject H0 - “there is sufficient evidence to support the claim that”, we don’t reject “there is not sufficient evidence to support the claim that”, if claim
does not and we reject H0 “There is sufficient evidence to warrant rejection”, we don’t reject “There is not sufficient evidence to warrant rejection of the claim that”