Intro to Biostatistics Course
Intro to Biostatistics Course
Biostatistics
                                www.harding.edu/plummer/biostats/biostats.pdf
Spring 2018
Course Description
     An introductory computer-based statistics course that includes instruction in SYSTAT. Topics
covered include populations and samples, variables, probability distributions, descriptive statistics,
statistical inference, and hypothesis testing. Included are selected parametric and non-parametric tests
for examining differences in means, variances, and frequencies as well as correlation, regression, and
tests of independence.
     Emphasis is given to practical matters such as how to choose appropriate analyses and how to
interpret results, both statistically and biologically. High school algebra is the only math background
you need. Biostats is a practical application course - to learn it, you have to do it. Failing to apply
statistical concepts and procedures on a regular basis will diminish your chances of understanding the
material and earning the grade you desire.
Student Learning Outcomes – By the end of the semester you will be able to:
 understand how science and statistics interact
 apply basic statistical procedures using professional statistical software
 read and understand primary biological literature
Evaluation
 Exam 1 20%            Exams 1-3 are comprehensive and consist of Content (scantron/short answer
                       50%) and Practical (SYSTAT problems/graphing 50%) sections. An extra
Exam 2       20%
                       point may be earned on each exam if you are present in class when feedback is
Exam 3       20%       given on your graded exams. Exam study guides
Quizzes      20% ~10 announced quizzes and exercises
                       The final exam is a comprehensive scantron exam taken during the regularly
Final        20%       scheduled final exam period. Unlike Exams 1-3, you will not use a computer on
Exam                   the final for any task; this includes SYSTAT. Exam study guides
                                                                                                           2
Classroom Policies
  Computer resources that may be viewed during lecture include the course website, SYSTAT, and
     your M-drive. All other uses (e.g., social notworking sites such as Facebook, Twitter, Instagram,
     email, blogs, sports news, pictures of your girl/boyfriend, etc.) are off limits during lecture.
  Cell phone use during lecture is prohibited. If you must send or receive a text or call during
     lecture time, please excuse yourself from the classroom and take it to the hallway.
  Regular class attendance is necessary to do well in this course. Excessive unexcused absences will
     be handled on an individual basis. An official HU class excuse or prior arrangements with the
     instructor is necessary to be excused from an exam.
  Cheating in all its forms is inconsistent with Christian faith and practice and will result in
     sanctions up to and including dismissal from the class with a failing grade. Instances of
     dishonesty will be handled according to the procedures delineated in the Harding University
     catalog.
  The visual appearance or use of any unapproved electronic device during an exam will be
     interpreted as cheating and will result in a zero for that exam.
  In accordance with the official Time Management Policy of the University, you are expected to
     spend two hours outside of class for each credit hour spent in class each week. That amounts to six
     additional hours per week, two of which are imposed on you in conjunction with regular class
     time.
  THE ONLINE BIOSTATS LECTURE NOTES ARE NOT COMPLETE SOURCES OF INFORMATION
     FOR EXAMS. In general, students are responsible for anything discussed in class.
 My Responsibilities
 Because, as your teacher, I have a substantial responsibility to you and to the Lord (James 3:1), I
 promise my best effort to you in Biol. 254. I pray that my lectures will be clear, my expectations
 reasonable, and my exams vigorous, thorough, challenging, and fair. I also pray that your grade will
 reflect both your ability and your preparation. Finally, I hope that you will learn something
 substantive in my class regardless of what you think about the subject matter. For further insight into
 my teaching philosophy, click here - Good luck!
 Misc.
  You will need a personal Dropbox account. Data files for the course are available in a shared Dropbox
    folder called “Student Biostats.” You should download these files to your M-drive.
  Statements on academic dishonesty, teaching evolution, and students with disabilities
                             Introduction to Statistics
                                               Home
-e.g., “There are three kinds of lies: lies, damn lies, and statistics.” -B. Disraeli
                                  Descriptive Statistics
                                                 Home
Statistical Basics
A. Definitions
    1. variable - characteristics that may differ (vary) among individuals
        a. measured
        b. derived (non-measured); derived from measured variables
        c. dependent/response variable vs. independent/predictor variable
    2. data - values of variables for individuals (singular datum)
    3. case/observation - an individual; symbolize: x1, x2, ...xn (n=sample size)
B. Collection of data
   1. population - all individuals of a defined universe (= whatever we say it is!)
   2. sample - subset of population; used to make inferences regarding the population
   3. statistical error - difference between the real population value and the estimates (from
      sample data) of the population value
   4. randomness - all individuals have equal probability of being sampled
   5. independence - value of one case does not affect the value of other cases
Introduction to SYSTAT
 Prepare a SYSTAT data file using the data below. These data are measurements taken from
 10 specimens of spiny guanotzits from Arkansas and Missouri. The variables are: collection
 locality (categorical), length of body (continuous), sex (categorical), weight of body
 (continuous), amount of pigment on the lower jaw (ranked), and number of scales on the chin
 (discrete).
     Case           1      2     3     4     5      6     7      8    9      10
     Locality       AR AR MO MO MO AR AR MO AR MO
     Length (mm) 22.5 21.4 20.8 20.6 19.8 20.1 22.3 21.7 20.4 21.1
     Sex            m      m     f     f     f      f     m      f    m      f
     Weight (g)     333 298 401 257 21              30    478 400 35         288
     Pigment        4      5     5    3      2      1     1      5    4      5
     No. scales     23     22    14    26    9      21    17     12   15     12
 Name your data file first.syz (the file extension .syz identifies a SYSTAT data file). After you
 finish entering the data, proofread the file to make sure that the data are correct, edit if
 necessary, save the file and close it. Reopen the file and use it to learn the following menus
 and functions:
  File Menu (New, Open, Save, Save As, Print, Exit)
  Edit Menu (Undo, Cut, Copy, Paste, Copy Graph, Delete, Options)
  Data Menu (Variable properties, Transform [Let and If - Then Let], By Groups, Select Cases)
  Graph Menu (Histogram)
  Analyze Menu (One-Way Frequency Tables, Basic Statistics, Tables)
                                                                        6
Exercises
1. calculate the average guanotzit weight (254.1g)                      5                      0.5
                                                                                                      Proportion per Bar
********************************************
                                    Parameter          Statistic
           Mean                      = x/n           x = x/n (=“x-bar”)
           Variance                 2 = (x-) /n
                                               2
                                                       s2 = (x-x)2/n-1
           Standard Deviation        = √2            s = √(s2) (=“SD”)
          QUESTION: What are the descriptive statistics of snout-vent-length for female salamanders
                collected in Arkansas?
   You now have sufficient knowledge to begin the Graph Construction Exercise on p15.
                                                                                           9
   Probability Basics
   Example: 1 coin toss- possibilities: 1H, 1T
      a. probabilities: no. ways an event (H or T) can
           occur /total no events (2) possible; “division”
           rule; 1H [1/2] = 0.5; 1T [1/2] = 0.5
      b. add all possibilities = 1 [0.5 + 0.5 = 1]
      c. probability distribution shape
Binomial Distribution
   1. formula: P(x) = (n!/(x!(n-x)!))pxq(n-x)
      -no need to memorize the formula but you must be able to recognize the formula and
        each of its terms
   2. terms
      P = probability of the number of
            occurrences of the event of interest
      p = probability of event of interest =
            head (”success”)
      q = probability of other event (1-p) =
            not head (”failure”)
      n = number of “simultaneous” events (trials)
      x = number of occurrences of the event of interest
 Based on the theory of sex determination in mammals (equal chance of being male or female),
 calculate the expected frequencies for the number of males in these litters.
Importance of sample size for observed data (1 coin example, compare to theoretical)
  IF observed = norm coin, THEN the larger the n, the closer we approximate expected
    conversely, THEN the smaller the n, the more we deviate from expected
                                                                                                 11
 Data are number of male hatchlings emerging from 84 nests of kaw turtles (kaw turtles
 always lay 6 eggs per nest).
Discrete probability distribution - Poisson (expected distribution for rare and random events)
     1. Poisson: = 2 (2/= 1) - distribution defined by mean only; low value (rare
        events; e.g., recapture rates, bacterial viruses infecting bacteria)
     2. Poisson formula: P(x) = (x xe-x)/x!
        -Students: no need to memorize the formula
          but you must be able to recognize the formula
          and each of its terms
     3. terms
        -P = probability of the number of
           occurrences of the event of interest
        - x = mean occurrence of event of interest
        - e = mathematical constant (=2.71828)
        - x = number of occurrences of
               the event of interest
     6. Poisson shape determined byx
Example: An ecologist counted the number of maple seedlings in 100 quadrats
                                                  Expected     Expected
                                                  proportion number (frequency)
     prop (0 seedlings) = (1.410e-1.41)/0! =      0.244 x 100 = 24.41
     prop (1 seedling) = (1.411e-1.41)/1! =       0.344 x 100 = 34.42
     -etc.
 2           22
 3           3                       s2 =                   (ans: 0.611)
 4           1
 Total       200         200        s2/x =                (ans: 1.002)
          Do these data support the hypothesis that the chance of being killed by a horse in the
          Prussian Army Corps is a rare and random event? Support your conclusion.
                                                                                                        13
 1. draw 100 dots on the 10x10 grid on the next page (keep your eyes open, try to place dots
    randomly
 2. count the number of cells with different numbers of dots
 3. create a frequency table of your data
 4. calculate the mean and variance of the number of dots per cell
mean = variance =
6. interpret: ratio = 1 (random); ratio <1 (evenly spaced); ratio >1 (clumped)
 7. Application: patterns of distribution in space reflect biological processes; for example, disease
    spread and behavioral/ecological interactions
14
                                                                                                                   15
                                         Graph Construction
                                                         Home
   In this exercise, you will learn to construct five basic graphs used by biologists. The rules for
graph construction presented here will apply to all graphs you construct during the semester. As
you finish graphs, copy and paste each image to a Word file named graphexercise, add the caption,
and save. There are three parts to the exercise:
   1. You will reproduce 5 finished graphs given to you;
   2. You will be given data and asked to construct 5 appropriate graphs;
   3. You will find an example of each of the 5 graph types in the primary literature.
B. Graph reproduction - Reproduce each graph (1-5) illustrated below. Read the description of
   each data file before beginning. Copy and paste your SYSTAT output into a Word file named
   graphexercise, add captions, and save.
                                                  90
                                                                                                                    0.10
                                                  80
Number of Captures
50 0.06
                                                  40
                                                                                                                    0.04
                                                  30
                                                  20
                                                                                                                    0.02
                                                  10
                                                      0                                                              0.00
                                                      1300           1400    1500    1600   1700   1800   1900    2000
                                                                                    Location (m)
                   Fig. 1. The distribution of captures of green snakes according to location.
2. BAR - A SYSTAT Bar graph plots the mean of one variable against another variable.
   Duplicate the BAR graph below. Note bar fill, axis titles, error bars, data plotted, etc. The
   data are in MOUSEDIET.SYZ (Cooper).
300
                                                               250
                                               BODY MASS (g)
200
150
                                                               100
                                                                            5K-96     AIN-cas   AIN-spi   P5001
                                                                                            DIET
                   Fig. 2. The relationship of mean body mass and diet in laboratory mice fed
                   different diets. Plotted are mean  1 SD. Sample sizes are: 5K-96, n=34; AIN-
                   cas, n=35; AIN-spi, n=32; P5001, n=42.
                                                                                                 17
3. DOT - A SYSTAT Dot graph plots the mean of one variable against a discrete or categorical
   variable. Duplicate the Dot graph below. Note symbols, error bars, fill, axis titles, axis
   ranges, data plotted, etc. The data are in WORMSURVIVE.SYZ (JMGoy).
50
30
20
10
                                                 -10
                                                       0   1   2   3   4   5   6   7
                                           TRIAL DAY
           Fig. 3. Mean number of C. elegans exhibiting unimpaired movement according to trial
           day. Plotted are mean  1 SD. Sample sizes are day 1, n=48; day 2, n=51; day 3,
           n=49; day 4, n=15; day 5, n=7; day 6, n=2.
4. BOX – A SYSTAT Box Plot plots the quartiles of one variable against a discrete or
   categorical variable. Duplicate the Box Plot below. Note symbols, axis titles, axis ranges,
   selected data plotted, etc. The data are in CAVESALYS.SYZ (Sanders).
           Fig. 4. Box plot of the body lengths of female Eurycea lucifuga captured in Arkansas
           and Kentucky caves in February and March. Plotted are the median (horizontal line),
           the 25th and 75th quartiles (box) and the maximum and minimum values (whiskers).
                                                                                                     18
   5. SCATTERPLOT - A SYSTAT Scatterplot plots individual cases of one variable against
      another variable. Duplicate the scatterplot below. Note symbols, axis titles, axis ranges,
      selected data plotted, etc. The data are in LONOKE.SYZ (Plummer).
800
                                               600
                             BODY WEIGHT (g)
400
200
                                                0
                                                50    60    70   80   90    100
                                                     SNOUT-VENT LENGTH (cm)
               Fig. 5. The relationship of body weight and snout-vent length in 99 adult (=individuals
               >50 cm SVL) male diamondback water snakes.
C. Graph construction: Construct an appropriate graph for each of the following problems and
   save in your graphexercise file.
   6. Use the following data on bill lengths (mm) of 42 belted kingfishers to construct a graph (Fig.
      6) that plots the median and other quartiles separately for males, females, and the sexes
      combined (3 groups).
               males: 48.1, 47.7, 48.0, 50.6, 50.8, 49.9, 49.3, 50.8, 46.9, 49.9, 48.8,
                      47.5, 48.2, 51.0, 48.8, 52.0, 51.8, 51.0, 50.1, 47.7, 49.9
               females: 53.8, 59.2, 52.3, 59.3, 56.5, 56.2, 55.6, 57.7, 52.5, 47.8,
                      51.5, 55.8, 57.5, 56.8, 47.0, 50.4, 58.0, 61.2, 56.5, 59.3, 59.2
D. Literature Graphs: The third part of this exercise consists of finding an example of each of the
   five graph types in primary literature papers.
   As you locate an example of each graph in the literature, download a digital copy, insert into
   graphexercise, and save in order - Fig. 11 Histogram, Fig. 12 Bar, Fig. 13 Dot, Fig. 14 Box,
   and Fig. 15 Scatterplot. Make sure to include the caption. Under each graph caption, type the
   citation of the paper where you found the graph. Proper citation format is: last name, initials,
   initials, last name, and initials, last name. year. title. journal volume:pages. Here’s an
   example;
   Harless, M.L., A.D. Walde, and D.K. Delaney. 2010. Sampling considerations for improving
      home range estimates of desert tortoises: effects of estimator, sampling regime, and sex.
      Herpetological Conservation and Biology 5:374-387.
   Note: Histogram, Bar, Dot, Box, and Scatterplot are names given to particular graphs by SYSTAT.
   You may find different names in other statistical software and in the literature; for example, a
   histogram may be called a frequency distribution or a bar graph. Don’t let that confuse you! You
   should be skilled enough to quickly determine the type of graph just by looking and applying your
   knowledge. For example, ask yourself what statistic is plotted on the graph; is it frequencies, means,
   medians, or individual cases?
   Turn in a printed copy of graphexercise on the due date. Print two graphs per page. Do
   not separate the graphs from their respective captions.
                           Inferential Statistics
                                              Home
A. Data that are influenced by many small and unrelated random effects are approximately
   normally distributed (math: Fuzzy Central Limit Theorem); extremely widespread and
   common in nature
B. Forms the conceptual basis of a large number of statistical procedures - one of the most
   important theoretical distributions in statistics
C. Properties
   1. formula: 1/(2)exp(-(x-)2/22)
   2. students – no need to memorize the formula but you must
      be able to recognize it
   3. shape determined by mean and SD
   4. symetrical around the mean (mean=mode=median)
Exercise: practice SYSTAT Probability Plot and One-sample KS Test using the variable
H2OOUT from file DLWMEANS.SYZ. Note that H2OOUT is not normally distributed
(skewed)
1. Data transformations - many procedures in statistics assume that data are normally
   distributed. If data are not normally distributed, one can transform the data to another
   measurement scale in an effort to normalize them. Deciding which transformation to use is
   entirely practical, i.e., the “right” transformation is whatever makes the data normally
   distributed. Trial-and-error applications of various transformations may be necessary to
   determine which will work. However, some transformations work better in some situations
   than in others. Examples of transformations commonly used in biology are the logarithmic,
   arcsine, and square-root transformations.
      the logarithmic transformation is useful in a wide variety of situations and is by far the
       most commonly used transformation in biology
      the arcsine (inverse sine) transformation is used specifically when data are in the form of
       proportions or percentages
      the square-root transformation is used specifically when data are in the form of counts
2. Transform the variable H2OOUT with common logarithms and retest for normality with both
   Probability Plot and KS. Note that the SYSTAT designation for common logs is L10
   (always use common logs in Biol. 254). After transformation, the new variable
   L10H2OOUT should now be normal
             Population                                                           Sample
    (numerical properties= parameters)    “Error”                   (numerical properties = statistics)
  The population is what we want to                      The sample is what we use to understand the
  understand.                                            population.
                                                          Descriptive stats: describe data in the sample
                                                          Inferential stats: infer from sample to population
    Estimation of parameters
   1. How well does the sample mean (x) estimate the population mean (µ)?
      a. in a normally distributed population, 95% of the cases lie betweenx - 1.96 SD andx
         + 1.96 SD
      b. in a normal sampling distribution, 95% of the means lie betweenx - 1.96 SE and x
         + 1.96 SE
      c. interpretation: 95% chance that population mean is enclosed within these limits (95%
         confidence limits)
       d. problem: sampling distributions of means may depart from normality if sample size is
          small (central limit theorem)
       e. solution: use distribution that adjusts for sample size - Student’s t-distribution (shape
          determined by 3 characteristics):
                                                                                           24
          mean, SD, df
          areas of curve that
           exclude a given
           proportion of the
           distribution vary
           with n (Tables)
          at infinity df, t0.05 =
           1.96 as in normal
           distribution
   f. to calculate 95% CLs using a t-distribution, replace 1.96 with value from t-table
       UL: mean + (t[0.05, n-1]) x SE
       LL: mean - (t[0.05, n-1]) x SE
2. 95% CL in the public media: GPS accuracy, political polls, church surveys
Differences in means: graphic methods for ‘informed guessing’ whether means are
statistically different
To properly interpret graphs displaying descriptive statistics, you must know what the
error bars represent! (info found in the figure caption or in the M&M)
___________________________________
                                                                                Population Sample
   3. Example 2: compare sample mean with known                                 µ = 568   x = 598
      population
          SYSTAT: Analyze->Hypothesis Testing->Mean-
             >one-sample t-test
       c.    H0:x = µ; HA:x  µ
       d.    calculate (SYSTAT); one-sample t-test; test statistic, tcalc = 2.31
       e.    determine probability by comparing tcalc to ttab (tabled value; df=29; Tables); P =
             between 0.02 and 0.05)
       f.    at P=0.05 (alpha level); ttab = 2.045 (critical value)
       g.    tcalc (2.31) is greater than ttab (2.045), therefore P<0.05
       h.    two explanations for obtaining a high t value (2.31)
             null hypothesis is true; sample mean differed by chance alone (unlikely)
             null hypothesis is false (more likely)
       i.   1-sample t-test: rarely done in science… Why?
C. Writing null hypotheses for parametric difference tests and their nonparametric counterparts
   (does not include tests of frequencies or tests of relationships): required components
   1. indicator (H0)
   2. parameter (e.g., µ, 2)
   3. variable (e.g., length, mass)
   4. group (e.g., sex, color); for questions of differences between independent data only
         (no grouping variable for dependent data)
   5. relational operator (e.g., =, ≥, ≤)
           type I error (rejecting a true null hypothesis); fixed value set by scientific community
            (P=0.05); make mistake 1 out of 20 times
           type II error (failure to reject a false null hypothesis); can be minimized by:
            1. increasing sample size
            2. choosing the most powerful test (power = probability of rejecting a false null
                hypothesis); minimum power of 80% generally necessary for an acceptable
                biological conclusion when you cannot reject the null hypothesis
           Why not reduce probability of type I error? – increases probability of type II error
           Alpha set at 0.05 because it represents a compromise between making type I and type
            II errors
           SYSTAT - how to calculate power or to determine minimum sample size needed for
            a specific power level (Utilities->Power Analysis->specific test)
                                                                                                28
I II
       c.   modern experimental design was developed by Ronald Fisher (1930s). “…it should
            be noted that the null hypothesis is never proved or established, but is possibly
            disproved in the course of experimentation.”
IV. Statistical Software (usually found toward the end of M&M in primary literature
papers)
    SAS (no. 1 statistical software for scientists); high learning curve
    SYSTAT
    Minitab
    SPSS
    many others (http://en.wikipedia.org/wiki/Comparison_of_statistical_packages)
    Excel is not recommended for inferential statistical analysis.
                                                                                                                            30
Protocol for hypothesis testing - fill in each blank; write "NA” for questions that are not applicable.
                                                          Home
a. Means: If you think the appropriate test is a parametric test of differences in means
b. Variances: If you think the appropriate test is a parametric test of differences in variances,
c. Variables: If you think the appropriate test is a parametric test of relationships between variables,
-are the residuals or each variable normally distributed? [.2] Y/N_______; probs__________
E. Execute test(s) and identify and state value of each test statistic [2]. _________________
    (an incorrect answer limits further points)
G. State reject or cannot reject for each null hypothesis [1]. _________________________
                                Hypothesis Testing 1
                                                  Home
Example problems
1. Two purple-flowered pea plants, both heterozygous for flower color, were crossed, resulting in 78
   purple-flowered offspring and 22 white-flowered offspring. Question: Does this outcome differ from
   the expected 3:1 ratio of purple-flowered to white-flowered offspring? (Protocol link)
2. The data below are number of juvenile manatees killed by boats in Florida. Question: Are males and
   females equally susceptible to being killed by boats? (Protocol link)
      no. males killed (1985-1995): 206
      no. females killed (1985-1995): 127
                                                                                                       33
        Frequencies
        HAB$ (rows) by SEX$ (columns)
                 F          M     Total
           +----------------+
          P | 480 420 |           900
          R| 2          25 |      27
           +----------------+
        Total 482 445             927
                                           Did not
                                           contract        Contracted
                                           malaria         malaria
               Heterozygotes               1               14
               Homozygotes                 13              2
         SYSTAT output:
         Frequencies
         MALARIA$ (rows) by GENES$ (columns)
   Example problems
   1. The following data are frequency of rabies in skunks collected from three geographic areas.
      Question: Is the incidence of rabies dependent on geographic area? (Protocol link)
                       With            Without
       Area            Rabies          Rabies
       Ozarks          14              29
       Ouachitas       12              38
       Delta           11              35
   2. The following data are frequency of individuals with different hair colors according to sex.
      Question: Is human hair color dependent on sex? (Protocol link)
_________________________
  Bartlett's Test
  Variable        Chi-Square df    p-Value
  ABSORB_TOT 1.004           1.000 0.316
  Levene's Test - *For Levene’s, use the F-ratio based on the median.
  Variable                        F-Ratio df p-Value
  ABSORB_TOT Based on Mean 1.173 1, 10 0.304
                Based on Median 1.045 1, 10 0.331
Example problems
1. The following data are systolic blood pressure in two breeds of domestic cats. Question: Does
   variation in pressure (mm/Hg) differ between Siamese and Mynx cats? (Protocol link)
          Siamese:122, 138, 129, 152, 149, 166, 110, 114, 155, 136, 189, 145, 129, 115, 144, 134
          Mynx: 129, 128, 109, 115, 108, 116, 125, 124, 117, 132, 111, 113, 127
2. Three different methods were used to determine the dissolved oxygen content of lake water. Each of
   the three methods was applied to a sample of water six times, with the following results. Question:
   Do the three methods yielded equally variable results? (Protocol link)
          method 1                    method 2                     method 3
          10.96                       10.88                        10.73
          10.77                       10.71                        10.79
          10.90                       10.88                        10.78
          10.69                       10.86                        10.82
          10.87                       10.70                        10.88
          10.60                       10.89                        10.81
3. The following data are growth rate (g/d) in newborn rats fed four different diets. Question: Is growth
   rate equally variable among diets? (Protocol link)
                  diet A: 1.6, 1.9, 0.9, 1.1, 1.5, 1.0, 1.8, 1.6          diet C: 0.8, 0.9, 0.5, 0.6, 0.7, 0.5, 0.9, 0.8
                  diet B: 2.5, 2.0, 2.8, 2.6, 2.6, 2.9, 1.9, 2.1          diet D: 1.0, 1.1, 0.7, 0.8, 0.9, 0.7, 1.1, 1.0
                                                                                                                36
4. The following data are number of moths caught during the night by four different trap types.
   Question: Is there a difference in the variance of trap effectiveness? (Protocol link)
       Trap type 1:   41, 34, 33, 36, 40, 25, 31, 37, 34, 30, 38
       Trap type 2:   52, 55, 62, 56, 64, 56, 56, 55
       Trap type 3:   25, 33, 34, 37, 41, 34, 40, 36
       Trap type 4:   36, 41, 33, 28, 34, 40, 27, 37
 REVIEW
 Graphic methods for ‘informed guessing’ whether means are statistically
 different (not a substitute for a formal statistical test)
                                        VS.
                                        Fig. 1. (A) Feces production (x 1000) in juvenile and adult
                                        green snakes by month. (B) Feces production (x1000) in adult
                                        male and female green snakes by month. Plotted are
                                        means ± 1 SE.
Question: Do IAA levels differ between the wild type and triple mutants in the 4D germination
treatments?
     a.    H0:  IAA(WS) =  IAA(ILR/IAR/ILL)
     b. pooled variance t (“regular” t-test - assumes homogeneous variances); use this one
     c. separate variance t (“approximate” t-test - does not assume homogeneous variances)
     __________________________________________________
     Separate Variance
     Variable PLANT$ Mean Difference 95.00% Confidence Interval t      df    p-Value
                                     Lower Limit Upper Limit
     IAA      ILR/IAR/ILL -9.867     -17.493      -2.240        -3.592 4.000 0.023
              WS
     Pooled Variance
     Variable PLANT$        Mean Difference 95.00% Confidence Interval t      df    p-Value
                                            Lower Limit Upper Limit
     IAA        ILR/IAR/ILL -9.867          -17.493      -2.241        -3.592 4.000 0.023
                WS
                   :1
Example problems
1. The effect of copper sulfate on the mucus cells in the gill filaments of a species of fish was
   investigated. The number of mucus cells per square micron in the gill filaments of untreated fish and
   in fish exposed for 24 hours to copper sulfate (mg/l) was as follows. Question: Does exposure to
   copper sulfate affect the number of mucus cells in these fish? (Protocol link)
               untreated: 16, 17, 12, 18, 11, 18, 12, 15, 16, 14, 18, 12
               exposed:    8, 10, 12, 13, 14, 6, 5, 7, 10, 11, 9, 8
2. A species of bacterium was grown with either glucose or sucrose as a carbon source. After a period of
   incubation, the number of cells (X 106) was determined. Question: Is there a difference in growth
   rate of the bacterium between the two carbon sources? (Protocol link)
                     glucose: 6.3, 5.7, 6.8, 6.1, 5.2
                     sucrose: 5.8, 6.2, 6.0, 5.1, 5.8
                                                                                                       38
      Variables        Levels
      SEX (2 levels) 1.000 2.000
Example problems
1. Twenty people were randomly assigned to two groups of ten each. One group viewed a hairy spider,
   and the other group viewed a similar but nonhairy spider. Each person was asked to score the spider
   she or he viewed on a ranked scariness scale from 1 to 10 (10 being the most scary). The results are
   below. Question: Do people find hairy spiders scarier than nonhairy spiders? (Protocol link).
               hairy:    10, 8, 7, 9, 9, 10, 9, 9, 5, 8
               nonhairy: 7, 6, 8, 6, 1, 5, 4, 5, 6, 3
2. The mass (g) of random samples of adult male tuatara from two localities in New Zealand are given
   below. Question: Do animals from locality A differ in mean mass from locality B? (Protocol link)
       loc A: 510, 773, 840, 505, 765, 780, 235, 790, 440, 435, 815, 460, 690
       loc B: 650, 600, 600, 575, 452, 320, 660
                                                                                                        39
Question: Does early field metabolic rate differ from late field metabolic rate?
Example problems
1. Brucella abortus antibody titers (pfc/106 cells) in 15 turkeys were measured before and after a period
   of stress. Question: Did stress decrease antibody titer in these turkeys? (Protocol link)
       turkey no.:              1 2 3 4 5 6 7 8 9 10               11   12 13   14 15
       before stress:           20 18 19 18 17 14 17 10 13 16      20   17 16   19 8
       after stress:            17 14 16 19 14 18 8 10 12 15       8     6 17   5 3
2. Male hoop snakes, upon encountering one another, may engage in a protracted ritualized combat
   behavior until one establishes himself as dominant over the other. Six males were tested in the
   presence of a female and again in the absence of a female. Whether each male was tested first with or
   without a female was randomly determined. The results in interaction time (min.) are below.
   Question: Do these encounters last longer in the presence of a female? (Protocol link)
       snake no.:       1 2 3 4 5 6
       w/o female:      10 15 8 30 1 80
       w/ female:       59 35 70 65 43 90
                                                                                                                                     40
7. Question: Does field metabolic rate differ between early and late measurements?
 Example problems
1. The wattle thickness (mm) of 10 randomly selected chickens was measured before and after treatment
   with PHA. Question: Does treatment with PHA affect wattle thickness? (Protocol link)
           Chicken no.      1       2           3          4          5          6          7          8           9           10
          pretreatment     1.05    1.01        0.78       0.98       0.81       0.95       1.00       0.83        0.78        1.05
          posttreatment    3.48    5.02        5.37       5.45       5.37       3.92       6.54       3.42        3.72        3.25
2. Ten young men were asked to rate their feeling of well-being on a scale of 1 (worst) to 10 (best)
   before and after taking an experimental drug. Question: Does the drug increase a person’s sense of
   well-being? (Protocol link)
         individual no.:   1      2        3          4          5          6          7          8          9           10
         before drug:      5      8        2          7          5          2          9          3          9           6
         after drug:       7      9        1          9          5          9          9          9          10          7
     You are responsible for knowing how to work all the Practice Problems concerning differences
     in frequencies, association of frequencies, and differences in variances and two means
     (Goodness-of-Fit, Test of Independence, Fisher’s Exact Test,, Bartlett’s, Levene’s, Independent
     Samples t-test, Paired Samples t-test, Mann-Whitney, Wilcoxon). Exam problems will be taken
     directly or modified from Example and Practice Problems.
                                                                                                                  41
                                          Hypothesis Testing 2
                                                                   Home
One-way ANOVA
1.    Test whether sample means are from the same population
2.    Powerful and robust
3.    Null hypothesis: H0: var(group1) = var(group2) = var(group3), etc.
4.    Why not use multiple t-tests? – “The problem of multiple comparisons”
                     1         2                             1         2        3
                                                                                                  4 means = 30%
                             5% Type I error                                 15%                  5 means = 50%
     REVIEW: Required components of a null hypothesis for questions of differences in means or variances.
          1. Indicator (H0)
          2. Parameter (e.g., µ, 2)
          3. Variable (e.g., length, mass)
          4. Group (e.g., sex, color); for questions of differences between independent data only (no grouping
              variable for dependent data). Groups are designated by being enclosed in parentheses.
          5. Relational operator (e.g., =, ≥, ≤)
     Examples:
       -independent: H0: µlength(males) = µlength(females)
       -dependent: H0: µbeforelength = µafterlength
                                                                                                                                42
 Dep Var: EGGWGT N: 245 Multiple R: 0.7090 Squared multiple R: 0.5027                    Normality and homogeneity
                                                                                         assumptions are tested after
                    Analysis of Variance                                                 ANOVA with the residuals
 Source             Sum-of-Squares DF          Mean-Square       F-Ratio P               (=difference between
 CLUTNO             334.2372          23       14.5321           9.7115 0.000            observed value and value
 Error              330.6987          221      1.4964                                    predicted by the model)
Example Problems
1. Random samples of a certain species of zooplankton were collected from five lakes and their selenium content (ppm) was
     determined. Was there a difference among lakes with respect to selenium content? (Protocol link)
2.    The following data are amount of food (kg) consumed per day by adult deer at different times of the year. Test the null
     hypothesis that food consumption was the same for all the months tested. (Protocol link)
After significant ANOVA: Which means are different from which other means?
Post hoc pairwise tests counteract the problem of maintaining an alpha level of 0.05 for multiple
comparisons; many different post hoc tests
Example Problems
1.   In a study of snake hibernation, fifteen pythons of similar size and age were randomly assigned to three groups. One
     group was treated with drug A, one group with drug B, and the third group was not treated. Their systolic blood pressure
     (mmHg) was measured 24 hours after administration of the treatments. Do the drugs affect blood pressure? If so, do they
     have similar effects? (Protocol link)
              control: 130, 135, 132, 128, 130
              drug A: 118, 120, 125, 119, 121
              drug B: 105, 110, 98, 106, 105
2.   Fourteen hucksters were assigned at random to one of three experimental groups and fed a different diet for six months.
     Use the following data on huckster mass (kg) at the end of the experiment to determine if diet affected body size. Which
     diet produced the heaviest hucksters? (Protocol link)
Kruskal_Wallis Test
1. Test whether three or more sample                      SYSTAT output: (TREAT.SYZ; Plummer, select clutno<8); treat.ppt
   means are from the same population                      Categorical values encountered during processing are:
2. Non-parametric counterpart to one-way                   CLUTNO (7 levels)
   ANOVA                                                       1,    2,     3,    4,    5,    6,    7
3. Null hypothesis: H0: var (group1) =
   var(group2) = var(group3), etc.                      Kruskal-Wallis One-Way Analysis of Variance for 89 cases
                                                          Dependent variable is EGGWGT
4. Test statistic (H) and probability
                                                          Grouping variable is CLUTNO
   source: Systat/Systat
5. SYSTAT path:                                           Group     Count   Rank Sum
   AnalyzeNonparametric                                   1         8      374.0000
   testsKruskal-Wallis (enter dependent                   2        12      731.5000
   and grouping (=factor) variables)                       3        14      245.0000
                                                           4        14      833.5000
                                                           5        15      490.0000
         Dwass-Steel-Chritchlow-Fligner
         Test for All Pairwise Comparisons                 6         9      720.0000
                                                           7        17      611.0000
         Group(i) Group(j) Statistic p-Value
                                                          Kruskal-Wallis Test Statistic [H] = 46.9358
         1        2        7.8558 0.0000
                                                          Probability is 0.0000 assuming Chi-square distribution with 6 DF
         1        3        1.2552 0.9745
         1        4        9.8964 0.0000
         1        5        5.8438 0.0007
         1        6        6.1237 0.0003               For post hoc pairwise comparisons after significant KW
         1        7        6.5521 0.0001
         2        3        -4.1468 0.0524            Dwass-Steel-Critchlow-Fligner Test (DSCF)
         2        4        0.9094 0.9954
         2        5        0.8282 0.9972
         etc.
Example Problems
1.   Twenty-four freshwater clams were randomly assigned to four groups of six each. One group was placed in deionized
     water, one group was placed in a solution of 0.5 mM sodium sulfate, and one group was placed in a solution of 0.74 mM
     sodium chloride. At the end of a specified time period, blood potassium levels (M K+) were determined. Did treatment
     affect blood potassium levels? (Protocol link)
2.   An entomologist interested in the vertical distribution of a fly species collected the following data on numbers of flies (no.
     flies/m3) from each of tree different vegetation layers. Use these data to test the hypothesis that fly abundance was the
     same in all three vegetation layers. (Protocol link)
               herbs        shrubs             trees
               14.0         8.4                6.9
               12.1         5.1                7.3
               5.6          5.5                5.8
               6.2          6.6                4.1
               12.2         6.3                5.4
                                                                                                                    45
   Example 1:
  SYSTAT output; MOUSEDIET.SYZ; Cooper
  Variables                    Levels
  DIET$ (4 levels)   5K-96 AIN-cas AIN-spi P5001
  NPDOSE$ (2 levels) 0     2000
  Analysis of Variance
  Source           Type III SS df Mean Squares F-ratio p-value
  DIET$            66249.445 3 22083.148       47.793 0.000             *Note there are 3 separate
  NPDOSE$          29869.989 1 29869.989       64.645 0.000              hypotheses tested
  DIET$*NPDOSE$ 1538.354 3 512.785             1.110 0.347
  Error            62378.033 135 462.060
                                                                  350
    Conclusions
     Diet explains a significant amount of                             Interaction Plot
                                                                  300
       variation in body weight. Body weight
       is greater in mice with the P 5001 diet.
                                                         BODWGT
       NPdose.
     There is no interaction between diet                        150
                                                                                                             NPDOSE$
 Variables          Levels
 DIET$ (2 levels)   c   j
 STRESS$ (2 levels) h   l
                                                                                       170
 Dependent Variable WGTGAIN
                                                                                             Interaction Plot
 N                  32                                                                 160
 Multiple R         0.844
 Squared Multiple R 0.712                                                              150
                                                                             WGTGAIN
                                                                                       140
 Analysis of Variance
 Source           Type III SS df Mean Squares F-ratio p-value
                                                                                       130
 DIET$            1568.000 1 1568.000         32.449 0.000
 STRESS$          1458.000 1 1458.000         30.173 0.000                             120
                                                                                                                            DIET$
 DIET$*STRESS$ 312.500        1 312.500       6.467 0.017                                                                   c
                                                                                                                            j
 Error            1353.000 28 48.321                                                   110
                                                                                                 h             l
                                                                                                     STRESS$
Example Problems
1. Use USOPHEO.SYZ; Plummer to determine if body size is affected by sex and/or location. Read the description of the data
    file before proceeding. (Protocol link)
2. Qualime epithelial cancer is hypothesized to result from either genotype or several environmental factors that vary by
   season. To address this hypothesis, use the data below on QSA level (g/g; the diagnostic test indicator of qualime
   cancer) that were collected on 20 individuals in different seasons. (Protocol link)
 QSA Genotype Season QSA Genotype Season                       QSA Genotype Season         QSA Genotype Season
 478   ZZ           Winter 425        ZW           Summer 428         ZZ           Summer 466       ZW           Winter
 538   ZZ           Winter 467        ZW           Summer 478         ZZ           Summer 522       ZW           Winter
 502   ZZ           Winter 444        ZW           Summer 455         ZZ           Summer 489       ZW           Winter
 496   ZZ           Winter 438        ZW           Summer 446         ZZ           Summer 475       ZW           Winter
 483   ZZ           Winter 431        ZW           Summer 432         ZZ           Summer 501       ZW           Winter
3. Work practice problem #56. Why is it a one-way rather than a two-way ANOVA? You will have to create a derived variable
    to work the problem. There are two ways to do this: (1) enter the derived variable directly on the SYSTAT data sheet or
    (2) enter all of the data shown and use TRANSFORM If.., Then Let to create the derived variable. You likely will need
    to review how to create derived variables.
                                                                                                           47
Correlation
      correlation analysis is a test of association that makes no assumption about a cause-and-effect
       relationship (i.e., there is no dependent and independent variable)
      addresses two questions
       - does an association exist between two variables?
       - if the association exits, what is its strength (effect)?
      requires that both variables be normally distributed random variables
         Means
         BUFO SPECIES
         1.5750 2.5000
         Pearson Correlation
         Matrix
                  BUFO SPECIES
         BUFO     1.0000
         SPECIES 0.6198 1.0000
         Matrix of Bonferroni
         Probabilities
                      BUFO SPECIES
         BUFO         0.0000
         SPECIES 0.0000 0.0000
                                                                                                                                 48
Bonferroni probability correction (counteracts the “The problem of multiple comparisons“); reduces
chances of making a Type 1 error (= “false negative” in the medical literature)
Means
BUFO RASP HYLA INDIVIDUALS SPECIES
1.5750 1.7750 0.7500 5.1750 2.5000
Example problems
1.   Use the following data on wing length (cm) and tail length (cm) in cowbirds to determine if there is a relationship between
     the two variables. (Protocol link)
          Wing     10.4   10.8   11.1   10.2    10.3   10.2     10.7   10.45     10.8   11.2   10.6
          Tail     7.4    7.6    7.9    7.2     7.4    7.1      7.4    7.2       7.8    7.7    7.8
2.   Use the following data taken from crabs to determine if there is a relationship between weight of gills (g) and weight of
     body (g) and between weight of thoracic shield (g) and weight of body. (Protocol link)
            Body      159     179    100    45     384        230      100     320      80     220      320
            Gill      14.4    15.2   11.3   2.5    22.7       14.9     11.4    15.81    4.19   15.39    17.25
            Thorax    80.5    85.2   49.9   21.1   195.3      111.5    56.6    156.1    39.0   108.91   160.1
Example problems
   1. The following data are ranked scores for ten students who took both a math and a biology aptitude examination. Is
       there a relationship between math and biology aptitude scores for these students? (Protocol link)
         Math        53 45 72 78 53 63 86 98 59 71
         Biology 83 37 41 84 56 85 77 87 70 59
  2.    Test the following data to determine if there is a relationship between the total length of aphid stem mothers and the
        mean thorax length of their parthenogenetic offspring. (Protocol link)
          Mother       8.7   8.5     9.4    10.0 6.3        7.8     11.9 6.5     6.6     10.6
          offspring 5.95 5.65 6.00 5.70 4.40 5.53 6.00 4.18 6.15 5.93
_________________________________________________________
       Procedure
       a. Fit regression line (least squares method; minimize (residuals2)
       b. Test for significance of slope
       c. Write the regression equation (general form Y = a (intercept) + b (slope) X
           -do NOT use math format (y = mx + b)
       d. Add regression statistics and variable names
  Output format
   Regression statistics: intercept (=constant), slope (=regression coefficient); standard error
   ANOVA table (test statistic, probability)
   KS test of assumptions
  Analysis of Variance
  Source     SS          df Mean Squares F-Ratio p-Value
  Regression 1.3063E+008 1 1.3063E+008 100.1291 0.0000            Test statistic and
  Residual 8.6103E+007 66 1304593.7089                              probability
            The regression equation from the above analysis and represented on the
            graph is:
                          EGGNO = -4914.6 + 561.6 FEMLEN
            In the regression equation, note that 'X' and 'Y' are replaced with the specific
            variables in question, i.e., FEMLEN and EGGNO. Also note that the
            dependent variable, EGGNO, is plotted on the Y axis, and the independent
            variable, FEMLEN, is plotted on the X axis. Another way of stating this is,
            "EGGNO is plotted against FEMLEN", or "EGGNO is regressed on
            FEMLEN."
                                                                             8000
                                                                                         A regression plot is a SYSTAT
                                                                             7000        Scatterplot with a linear smoother.
                                                                             6000
                                                                             5000
                                                                     EGGNO
4000
3000
2000
1000
                                                                               0
                                                                                    5               10             15          20
                                                                                                         FEMLEN
Example problems
   1. The following data are rate of oxygen consumption (ml/g/hr) in crows at different temperatures (C). Does
       temperature affect oxygen consumption in crows? Determine the equation for predicting oxygen consumption from
       temperature. (Protocol link)
    2.   Use the following data on mean adult body weight (mg) and larval density (no./mm 3) of fruit flies to determine if there
         is a functional relationship between adult body mass and the density at which it was reared. Determine the equation
         for predicting body weight from larval density. (Protocol link)
            density   1         3       5           6       10        20                40
            weight    1.356     1.356   1.284       1.252   0.989     0.664             0.475
                                                                                                        52
Extrapolation: linear regressions are statistically valid only within limits of the data (independent
variable, X); beyond data - do not know if relationship is linear
A regression of tooth size on actual body length for the living Carcharodon carcharias indicates by
extrapolation (assuming continued linearity) that C. megalodon was “only” 13 m (43 ft) in length!
Model building in regression (goal is to build a better model by increasing r2; results in
more accurate prediction)
Data transformation
   1. SYSTAT e.g.: calibrate transmitters; DEMO
   2. Linear vs. log10 data regressions - note increase in r2 and linearity with log transformation
3500 3.6
3000 3.5
                                                              3.4
                                                      LOGPI
             2500
        PI
                                                              3.3
             2000
                                                              3.2
             1500                                             3.1
             1000                                             3.0
                 0     10      20       30      40               0   10    20      30       40
                             TEMP                                         TEMP
                                                                                                                53
                                        Analysis of Variance
  Source                      Sum-of-Squares   df Mean-Square                          F-ratio          P
  Regression                    3518606.514     1 3518606.514                          221.638          0.000
  Residual                        79377.200     5    15875.440
                                        Analysis of Variance
  Source                      Sum-of-Squares   df Mean-Square                         F-ratio       P
  Regression                          0.184     1        0.184                       6185.068       0.000
  Residual                            0.000     5        0.000
      ____________________________________________
      Example 2: inverse prediction (predict X from Y); Y = 14.5 + 2.56X; by algebraic manipulation
      Y-14.5 = 2.56X; (Y-14.5)/2.56 = X
X = (175.78-14.5)/2.56 = 63
B. Semilog (logY, X) equations: log Y = log a + bX (must take the inverse log of
      log Y to get final answer on linear scale)
    Example: using the regression equation log Y = 1.42234 +0.047560X, predict Y when X = 12.1
    Note that the intercept (1.42234) is a log value (i.e., log a = 1.42234). You must not take the log of
    this value when calculating log Y; that would be the equivalent of taking the log of a log!
                                                                                                         54
C. Log-log (logY, logX) and exponential equations: log Y = log a + b(log X); Y = aXb
    Example 1 (logarithmic form): using the regression equation log Y = 2.53403 + 0.72000(log X),
    predict Y when X = 1.98
log Y = 2.53403 + 0.72000 (log 1.98) = 2.74763 (calculate regression coefficients and
    The most common form of the log-log regression equation, and one that is much easier to use is the
    exponential form:
log Y = log a + b(log X) = log a + log xb ; take inverse logs: Y = aXb (exponential form)
 Example 2 (exponential form): using the regression equation Y = 342X0.720, predict Y when
Y = 342(1.980.720) = 559.28
   Example1: A common belief is that men are stronger than women. Is this belief due to men being
   bigger or are men actually stronger when compared to women of similar body size? Test this question
   on data from a sample of healthy young adults (stronger.syz). The variables are sex, lean body mass,
   and a measure of strength called “slow, right extensor knee peak torque.”
2. Circular statistics (Raleigh Test) – techniques for data measured on an angular scale. Angular scales
   are circular in nature, have no designated zero, and the designation of high and low values is arbitrary.
   For example, 0 and 360 point to the same direction.
3. Principal component analysis (PCA) - variable reduction technique that describe variability among
   multiple observed variables in terms of a lower number of non-measured derived variables
                                                57
Statistical Tables
       Home
59
60