Study Guide
1) Dependent vs. Independent Variables
   a) The independent variable is said to be the cause, the dependent variable is said to
       be the effect.
   b) EXAMPLE: Let’s assume we are trying to predict one’s weight given one’s
       diet. Clearly dietary considerations are the cause and how heavy you are is the
       result because size generally does not dictate diet. So, if I only eat 3 apples per
       day, my weight will be much less than if I eat 20 Big Macs per day.
   c) Clearly, the independent variable precedes the dependent variable in time.
       i) Note: oftentimes, it is not clear cut which variable comes before another. For
           example, number of traffic accidents and seatbelt law. One is inclined to
           believe that the dependent variable is number of traffic accidents, and
           enactment of the seatbelt law acts to decrease them. However, it is very
           possible that the high number of accidents has actually led to the enactment of
           the seatbelt law in a particular state. Whatever the case, it is clear that in a
           causal relationship, the variable that causes an effect comes first in temporal
           order.
2) Discrete vs. Continuous variables
   a) A discrete variable is one that takes on a countable number of values, whereas
       a continuous variable is one that can theoretically take on an infinite number of
       values.
       i) Note: anything expressed as a percentage or proportion can take on an
           infinite number of values, since the range is anything in the closed interval
           [0,1].
   b) EXAMPLES: age, height, weight, time are continuous. Gender, number of
       children, number of sex partners are all discrete.
       i) Note carefully, in practice, most everything is recorded as a discrete variable.
           For example, if you are collecting data on “age”, you only record whole
           numbers in the database, and not 12.1 years, for example.
          (1) Exam strategy: if you are unsure of the level of measurement, your best
              guess is that it is discrete.
      ii) Note also, it is often very difficult to know whether something is discrete or
          continuous, for example, number of sexual relations, number of questions left
          unanswered on an exam.
      iii) Rule of thumb: Ask yourself whether recording the variable as a fraction
          makes sense. If not, it is discrete, if so, it is continuous. So, can I have ½ a
          kid? Can I have ½ a gender? No, this is nonsensical.
3) Nominal, Ordinal and Interval/Ratio variables
   a) Nominal: The categories are not “numerical” and cannot be thought of as
      “higher” or “lower” in a numerical sense. THIS IS AKA CATEGORICAL
      DATA, so that whenever the data is grouped into categories, the data is
      considered to be nominal.
      i) GOOD PROPERTIES OF NOMINAL VARIABLES:
          (1) Mutual exclusivity
              (a) The following example is a nominal variable (MOTHER’S AGE) that
                  is NOT mutually exclusive.
                           Mother's age
                                        Frequency
                 Valid   13-20 Years           21
                         20-30 Years          305
                         30-35 Years          101
                         35-55 Years           62
                         Total                489
              (b) Exhaustive: ALL values are considered. In the example above, are the
                  categories exhaustive?
                  (i) Relative homogeneity: cases should be truly comparable
      ii) The median and nominal level data.
   b) ORDINAL VARIABLES: Categories that are ranked
      i) EXAMPLE: In the gss dataset, there is a variable called “spanking” that asks
          whether the respondent favors spanking to discipline a child. The answers
           range from strongly agree to strongly disagree, and these can be considered
           ranked.
           (1) Note: We can make 0 – 10 strongly disagree to strongly agree or 10 – 0
               strongly disagree to strongly agree and it does not matter at all. The
               “numeric” values we assign to the answers are arbitrary and meaningless.
   c) Interval Ratio: Two properties
       i) Equal distance between values
       ii) 0 is a real value
4) What does Healey mean by “data reduction”?
   a) Data reduction involves using a few numbers to summarize the distribution of a
       variable, or an array of data as he calls it.
   b) What is the problem with using only a few numbers to summarize the distribution
       of a variable?
       i) Summarizing a distribution involves using the mean, denoted x , or standard
           deviation, denoted  , to describe the variable. This inevitably leads to a loss
           of information (precision and detail).
5) Rates: rates are defined as the number of actual occurrences of some phenomenon
   divided by the number of possible occurrences per some unit of time.
   a) EXAMPLE: In a city of 750,000 people, the frequency of unwed pregnancies in
       a one-year period was 1875. What is the unwed pregnancy rate for this city?
       i) ANS. 1875/750,000 = .0025 = 2.5 per thousand
       ii) What is the pregnancy rate per 10,000 people? 25
   b) EXAMPLE: In a city with population 1,000,000, there were 516 homicides in the
       past year. What is the homicide rate per 10,000 people? 5.16.
6) Measures of Central Tendency
   a) Measures of central tendency measure the typical value of a distribution.
       i) It is a way to summarize the distribution to give you an idea about the typical
           case of that distribution, in other words, the center of it.
   b) There are three measures of central tendency
       i) The mean: describes the average score
       ii) The mode: describes the most recurring score
       (1) Only used with nominal variables
   iii) The median: is the 50th Percentile of the distribution
       (1) A median is a special case of a percentile, which is the percentage of cases
          below which a specific percentage of cases fall.
c) How does the median differ from the mode and the mean? Unlike the mode or the
   mean, the median always represents the exact center of a distribution of scores,
   meaning that 50% of the cases always fall above the median and 50% of the cases
   always fall below the median.
d) Characteristics of the mean
   i) The mean is always the center of any distribution. The mean is the point
       around which all of the scores cancel out. Mathematically, this says that if I
       subtract the mean from each value and sum the results, the resulting sum will
                                                         n
       be equal to 0. This is mathematically given as    (x
                                                        i 1
                                                               i    x)  0
       (1) Example: consider 5 numbers 1, 2, 3, 4, 5. The mean is 3. The equation
          says to subtract the mean from each observation and the sum should be 0.
       1 – 3 = -2
       2 – 3 = -1
       3–3=0
       4–3=1
       5–3=2
       (-2) + (-1) + (0) + (1) + (2) = 0 
                          n             n       n
       More generally,    ( x i  x )   x i   x  nx  nx  0
                         i 1          i 1    i 1
       REMEMBER THIS RESULT FOR THE EXAM, BUT THERE IS NO NEED
       TO BE ABLE TO DERIVE IT.
ii) The mean may often be very misleading because it is sensitive to all
   observations whereas the median is not. In fact, the median is less sensitive
   to extreme observations and therefore it is often “better” to report the median.
   (1) To illustrate this, consider the familiar normal or “bell” curve. This is a
       symmetric distribution because there are as many values on the left as
       there are on the right of the center. Many natural phenomena have normal
       distributions, such as weight, height, etc.
   (2) There are important distributions that are not symmetric. IF THE
       DISTRIBUTION IS NOT SYMMETRIC THEN THE MEDIAN,
       MODE AND MEAN ARE NOT EQUAL. IT IS ONLY IN
       SYMMETRIC DISTRIBUTIONS WHEN THESE THREE
       MEASURES ARE EQUAL. When a distribution is not symmetric, it is
       skewed. There are two types of skewed distributions, right skewed and left
       skewed.
       (a) EXAMPLE of RIGHT SKEWED: Income. Often it is better to
          report the median than the mean, since the mean is misleading in
          extreme cases.
       (b) EXAMPLE. Consider the following summary of AGE. Notice that the
          arithmetic mean is somewhat greater than the median. The reason is
          that the distribution is right skewed. If the mean is larger than the
          median the distribution is __________ skewed.
                                             Statistics
                                   AGE OF RESPONDENT
                                   N       Valid      1385
                                           Missing       2
                                   Mean              44.94
                                   Median            41.00
                      To see this, create a histogram of the age variable.
                                    300
                                    200
                                    100
                                                                                                                         Std. Dev = 17.08
                                                                                                                         Mean = 44.9
                                     0                                                                                   N = 1385.00
                                          20.0      30.0      40.0       50.0   60.0       70.0         80.0      90.0
                                                 25.0      35.0   45.0      55.0        65.0     75.0      85.0
                                          AGE OF RESPONDENT
                    (c) Calculation of Grouped Mean
                                                                                   f
                      Minutes spent on test                          mid pt
                                                                                               mid pt x f
        0 to less than 5 minutes                                      2.5          2               5
        at least 5 but less than 10 mins                              7.5          12             90
        at least 10, less than 20 mins                                15        16                240
        Total                                                                      30             335
335/30 = 11·2
7) The summation operator
   a)     is called the summation operator, it is useful when representing the sum
         of a large group of numbers
          n
   b)    x
         i 1
                i   means
         n
   c)   x
        i 1
                2
                i   means
8) Measures of Dispersion
   a) What is a “measure of dispersion?”
        i) Measures of Central Tendency don’t tell anything about how much the data
               values differ from each other.
               (1) EXAMPLE: What is the mean of the following two distributions of
                    AGE?
                    (a) 50 50 50 50 50
                    (b) 10 20 50 80 90
               (2) The distributions are obviously very different.
                    (a) Measures of dispersion or variability attempt to quantify the spread of
                       observations.
                    (b) It is a measure of variability, usually defined in terms of variability
                       around the mean.
                    (c) The distance between the individual score and the mean value,
                       mathematically this is ( X i  X ).
                    (d) The larger the distance from the mean, the larger the deviation will be.
                    (e) If the scores were clustered around the mean, the less variability there
                       will be.
                       (i) PRACTICAL EXAMPLE: Let’s assume that average income for
                            people with PhD’s is $55,000 and average income for people with
                            a high school education is $20,000. Since opportunities for people
                            with merely a HS education are less than those with PhD’s most
                            people who only have a HS education would make somewhere
                            aroung 20K, there is not much variation. However, it is possible
                            for PhDs to make anywhere from $20K to $800K per year and
                            hence there is much more variation around the average salary for
                            PhDs than there is for HS graduates.
   b) Measures of dispersion we have looked at
        i) Inter-Quartile Range: defined as the 75th percentile minus the 25th percentile.
      ii) Quartile/Deciles
      iii) Standard deviations
      iv) Creating Box and Whiskers
9) Standardized Variables
   a) EXAMPLES
      i) Here is a random sample of eleven scores on a PLS 201 exam: 12, 16, 16, 18,
          23, 23, 24, 25, 25, 26, 29
          (1) Find the sample mean.
             (a) Answer: x = 21.5
          (2) Find the sample standard deviation.
             (a) Answer: you should get something approximating 5.
          (3) Find the median. Answer: 23
          (4) Find the z-score for the student who received the highest score on the
             exam. Answer: z = (29 - x )/sx = 1.5 where x = 21.5 and sx = 5.
             Interpretation: this student’s score was 1 ½ deviations above the mean.
      ii) Faculty salaries at a Midwestern university are normally distributed with a
          mean of $51,500 and standard deviation of $3,000.
          (1) Find the probability that one faculty member chosen at random has a
             salary less than $50,000.
             (a) Answer: X = salary of a randomly selected faculty member.
                 Given that X ~ N(51500, 3000), normal with mean 51,500 and
                 standard deviation 3,000. P(X < 50000) = P(Z < (50000 -
                 51500)/3000) = P(Z <= -.5) = .3085
      iii) The mean height of adults in an African village is 150 cm, the standard
          deviation is 6 cm. What is the probability that a randomly selected adult from
          this village will be lower than 162 cm, if we assume that the distribution of
          height in the population is normal?
          (1) Calculate the z-value for 162 cm based on z-transformation, and look up
             the corresponding p values using the table. The z-transformed value of x
                                 xx
                           z=
                                  sx
           =162 cm:                       mean = 150 cm, SD = 6 cm
                   162 – 150
             z=                  = +2
                       6
       (2) Looking up the corresponding p value to the z-value = +2 is 0.4772. This is
           the proportion of area under the curve between the mean and the specified
           z-value. We also need to add the proportion under the mean, since we are
           looking for the height under 162cm (which includes heights under 150 as
           well). Therefore, (50% + 47.72%) = 97.72%, that is probability that a
           randomly selected adult from this village will be lower than 162 cm is
           97.72%.
   iv) A random sample of 47 items is drawn from a population with mean 40 and
       standard deviation 1.46.
       (1) Give a range of values that is almost certain to contain any particular value
           of each item drawn.
           (a) Y should be within 3 standard deviations of the mean. That is, between
              35.62 and 44.38.
       (2) What is the probability that Y will be greater than 50?
           (a) P(Y > 50) = 50 – 40 /1.46 = 10/1.46 = 6.84  0
       (3) What is the probability that Y will be less than 38?
           (a) P(Y < 38) = 38-40/1.46 = Z(-1.37) = 0.0853 (i.e. column c in the table)
       (4) What is the probability that Y will be greater than 45?
           (a) P(Y > 45) = 0.0000
b) The doctor of a school has measured the height of pupils in the class 5A. The
   result (in cm) is follows
               130         132      138     136      131          153
               131         133      129     133      110          132
                          129         134          135          132          135          134
                          133         132          130          131          134          135
                          135         134          136          133          133          130
                             Table 3.2 Heights of the pupils of the class 5A
          i) Box plot method
               (1) Below are the steps to follow in constructing a box plot.
  Steps to follow in constructing a box plot
     1. Calculate the median M, lower and upper quartiles, Q1 and Q3, and the
        interquartile range, IQR= Q3 – Q1, for the data set.
     2. Construct a box with Q1 and Q3 located at the lower corners. The base width will
        then be equal to IQR. Draw a vertical line inside the box to locate the median M.
     3. Construct the limits on the box plot: Extreme Values are located a distance of
        1.5 * IQR below Q1 and above Q3;
     4. Locate the extremes on the box plot using asterisks (*).
Outer fences   Inner fences                                 Inner fences              Outer fences
                                               Q1       M       Q3  
                         1.5 * IQR                IQR                       1.5 * IQR                                                                                            
 Answer: your box plot should look like this
       Figure 3.6 Output from SPSS showing box plot for the data above.
          (2) For the following find the
             (a) Median
             (b) Quartile 1
             (c) Quartile 3
             (d) Interquartile range.
          (3) Draw a box and whisker plot, identifying any extreme values.
             (a) Remember to order the data before you begin.
                 (i) 32 30 36 27 24 33 34
                 (ii) 998 92 432 223 785 335 367 444 457 458 488
             (b) Answers
                 (i) Q1 = 27, Q2 = 32, Q3 = 34 IQR = 7 No extremes.
                 (ii) Q1 = 335, Q2 = 444, Q3 = 488 IQR = 153 extremes>=785, 92
Detailed Solutions
   c) Order the data: 24 27 30 32 33 34 36
      i) From here, it is easy to see that 32 is the median
      ii) To find Q1,  (.25)(7) = 1.75, rounding up gives 2 so Q1 = 27
      iii) To find Q3,  (.75)(7) = 5.25, rounding up gives 6 so Q3 = 34
      iv) IQR = 34 – 27 = 7
   v) Extremes: Q3 + 1.5(IQR) = 34 + 1.5(7) = 44.5, there are no values equal to
       that in the data, so there are no extremes. Also, Q1 – 1.5(IQR) = 27 – 1.5(7) =
       16.5, there are no values less than or equal to this, so there are no negative
       extremes.
d) Follow the same procedure for (ii).
Practice Multiple Choice Questions
    1. The average time between infection with the AIDS virus and developing AIDS
       has been estimated to be 8 years with a standard deviation of about 2 years.
       Approximately what fraction of people develop AIDS within 4 years of infection
       b
           a. about 5 %
           b. about 2.5%
           c. about 32%
           d. about 16%
           e. about 1%
    2. An instructor decides to "curve grades" in a course depending upon the percentile
       measures. Here are some summary statistics: b
Quantile Levels   Minimum      10.0%     25.0%    Median     75.0%     90.0%     Maximum
 Final Mark          10          48        55      66          78        87         93
       Which of the following is FALSE?
           a. About 1/4 of the class received a score of 55 or less.
           b. About 3/4 of the class received a score of 75% or less.
           c. About 50% of the class received grades between 55 and 78.
           d. This method assigns grades relative to how others do in a class rather than
              against an absolute standard.
           e. This method always has half of the class at or above the median grade.
    3. An experiment was performed upon rats to investigate the effect of ingesting Alar
       (a chemical sprayed on apple trees to keep fruit from dropping before ripe) upon
       subsequent cancer rates. The following variables were measured:
       gender (0=female, 1=male); weight (g); dose of Alar (nil, low, high); number of
       tumors
       The typical weight of a rat is about 800 g and the weights were rounded to the
       nearest gram. The number of tumors is around 10. Which of the following is
       FALSE? c
          a. Gender is nominal scale; dose is ordinal scale
          b. Gender is discrete; weight is continuous
          c. Number of tumors is discrete and is interval scale
          d. Dose is ordinal scale and discrete
          e. Weight is ratio scale; and number of tumors is discrete.
   4. Here are some summary statistics on the results of the experiment. Draw suitable
      BOXPLOTS to compare the results. Salmon production is in kg/km of spawning
      sites.
Quantiles
Level     Minimum 10.0%             25.0%     median 75.0%         90.0%    maximum
clear cut 0.9      0.9              3.3       19.2   48.1          87.4     90.0
selective 0.9     1.2               8.5       29.3   51.5          93.4     108.0
Means and Std Deviations
Level     Number       Mean      Std Dev    Std Err Mean
clear cut 12           29.7      30.4       8.8
selective 12            34.6      31.1       9.0
Solution: In this case, side-by-side box plots would be suitable
   5. What do you conclude from your boxplot and the descriptive statistics? Be sure to
      explain how your plot leads you to this conclusion.
   Solution: It appears that clear cut streams produce, on average, less salmon than
   selectively cut streams. This is because the box plot for the clear-cut areas is shifted
   down relative to the box-plot for the selective harvest areas; and the median of the
   clear cut areas appears to be less than the median of the selective harvest areas
   6. Which one of the following statements is FALSE? a
        a. Pie charts are better than bar graphs for comparing relative sizes.
       b. Data that are nominal scale are presented using frequency tables.
       c. Means and standard deviation of ordinal data are meaningless.
       d. Box-plots are a good choice for comparing the distribution of values
          among groups.
7. As part of a study to investigate the effects of stubble burning, the following
   variables were measured at several sites around Winnipeg:
        pH of soil (to one decimal place, e.g., 6.3) 0 ph is not meaningful
        crop grown (0=wheat, 1=barley, 2=oats, 3=other);
        amount of stubble (0=light, 1=medium, 2=heavy);
        date of final harvesting (e.g., 10 Oct 92)
   The scales of these variables are:
       a.   interval, ordinal, ratio, ratio
       b.   interval, nominal, nominal, interval
       c.   interval, nominal, ordinal, interval
       d.   ratio, ordinal, ordinal, ratio
       e.   interval, nominal, ordinal, ratio
8. A student discovers that his grade on a recent test was the 72nd percentile. If 90
   students wrote the test, then approximately how many students received a higher
   grade than he did? b
       a. 65
       b. 25
       c. 72
       d. 71
       e. 18
Solution: (1 - .72)(90) = 25.2 or (.72)(90) = 64.8 students who scored less than him
so 90 – 64.8 is approximately 25.
9. Many professional schools require applicants to take a standardized test. Suppose
   that 1000 students write the test, and you find that your mark of 63 (out of 100)
   was the 73rd percentile. This means: c
       a. At least 73% of the people got 63 or better.
       b. At least 270 people got 73 or better.
       c. At least 270 people got 63 or better.
       d. At least 27% of the people got 73 or worse.
       e. At least 730 people got 73 or better.
Solution: We know that 73% of the people scored below 63 and 27% of the people
scored better than 63, so rule out a and d. This means that (.73)(1000) = 730 scored below
63 and (.27)(1000) = 270 scored better than 63. C must be the correct answer.
Last Question
1994: DIVORCES PER 1,000 POPULATION
Valid              Frequency           Percent         Cumulative
                                                       Percent
5.0                2                   14.3            14.3
5.1                2                   14.3            28.6
5.2                1                   7.1             35.7
5.3                1                   7.1             42.9
5.4                1                   7.1             50.0
5.5                1                   7.1             57.1
5.6                1                   7.1             64.3
5.7                1                   7.1             71.4
5.8                2                   14.3            85.7
5.9                1                   7.1             92.9
6.0                1                   7.1             100.0
                 Total 14      100.0
Identify the Percent, Cumulative Percent and state the median, Q1, Q3, Interquartile
Range and identify any extreme values.
5.8 – 5.1 = .7
Q3 + 1.5(.7) = 5.8 + 1.5(.7) = 6.85  there are no upper extreme values
Q1 – 1.5(.7) = 5.1 – 1.5(.7) = 4.05  there are no lower extreme values