Solutions Modernstatistics
Solutions Modernstatistics
Springer Nature
Contents
                                                                                                       3
Chapter 1
Analyzing Variability: Descriptive Statistics
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import mistat
from scipy import stats
Solution 1.1 random.choices selects 𝑘 values from the list using sampling with
replacement.
import random
random.seed(1)
values = random.choices([1, 2, 3, 4, 5, 6], k=50)
   The expected frequency in each cell, under randomness is 50/6 = 8.3. You will
get different numerical results, due to randomness.
Solution 1.2 The Python function range is an iterator. As we need a list of values,
we need to explicitly convert it.
                                                                                 5
6                                            1 Analyzing Variability: Descriptive Statistics
x = list(range(50))
y = [5 + 2.5 * xi for xi in x]
y = [yi + random.uniform(-10, 10) for yi in y]
pd.DataFrame({'x': x, 'y': y}).plot.scatter(x='x', y='y')
plt.show()
    120
    100
      80
      60
y
      40
      20
       0
            0          10         20          30            40            50
                                        x
0.1   4
0.3   12
0.7   33
0.9   43
Notice that the expected values of the sums are 5, 15, 35 and 45.
Solution 1.4 We can plot the data and calculate mean and standard deviation.
plt.show()
 12.0
 11.5
 11.0
 10.5
 10.0
   9.5
   9.0
   8.5
   8.0
         0.0      2.5       5.0      7.5     10.0   12.5   15.0   17.5
   As shown in the following Figure, the measurements on Instrument 1, ⃝, seem
to be accurate but less precise than those on Instrument 2, +. Instrument 2 seems to
have an upward bias (inaccurate). Quantitatively, the mean of the measurements on
Instrument 1 is 𝑋¯ 1 = 10.034 and its standard deviation is 𝑆1 = 0.871. For Instrument
2 we have 𝑋¯ 2 = 10.983 and 𝑆2 = 0.569.
Solution 1.5 If the scale is inaccurate it will show on the average a deterministic
component different than the nominal weight. If the scale is imprecise, different
weight measurements will show a high degree of variability around the correct
nominal weight. Problems with stability arise when the accuracy of the scale changes
with time, and the scale should be recalibrated.
Solution 1.6 The method random.choices creates a random selection with re-
placement. Note that in range(start, end) the end argument is excluded. We
therefore need to set it to 101.
import random
random.choices(range(1, 101), k=20)
 [6, 88, 57, 20, 51, 49, 36, 35, 54, 63, 62, 46, 3, 23, 18, 59, 87, 80,
 80, 82]
8                                            1 Analyzing Variability: Descriptive Statistics
import random
random.sample(range(11, 31), 10)
[19, 12, 13, 28, 11, 18, 26, 23, 15, 14]
Solution 1.8 (i) 265 = 11, 881, 376; (ii) 7, 893, 600; (iii) 263 = 17, 576; (iv) 210 =
1024; (v) 10
           5 = 252.
Solution 1.10
filmsp = mistat.load_data('FILMSP')
filmsp.plot.hist()
plt.show()
            80
            70
            60
            50
Frequency
            40
            30
            20
            10
            0
                         70   80    90         100          110          120
Solution 1.11
coal = mistat.load_data('COAL')
pd.DataFrame(coal.value_counts(sort=False))
                 count
  COAL
  3                17
  6                 3
  4                 6
  0                33
  5                 5
  2                18
  7                 1
  1                28
1 Analyzing Variability: Descriptive Statistics                                   9
Solution 1.12 For (1) and (2), we can use the pandas value counts method. e.g.:
car = mistat.load_data('CAR')
car['cyl'].value_counts(sort=False)
 cyl
 4    66
 6    30
 8    13
 Name: count, dtype: int64
 turn
 (28, 30]     3
 (30, 32]    16
 (32, 34]    16
 (34, 36]    26
 (36, 38]    20
 (38, 40]    18
 (40, 42]     8
 (42, 44]     2
 Name: count, dtype: int64
   Note that the bin intervals are open on the left and closed on the right.
   (iv) Frequency distribution of Horsepower:
 hp
 (50, 75]       7
 (75, 100]     34
 (100, 125]    22
 (125, 150]    18
 (150, 175]    17
 (175, 200]     6
 (200, 225]     4
 (225, 250]     1
 Name: count, dtype: int64
mpg
(9, 14]      1
(14, 19]    42
(19, 24]    41
(24, 29]    22
(29, 34]     3
Name: count, dtype: int64
Solution 1.13
filmsp = mistat.load_data('FILMSP')
filmsp = filmsp.sort_values(ignore_index=True)      # sort and reset index
0.00     66.0
0.25    102.0
0.50    105.0
0.75    109.0
1.00    118.0
Name: FILMSP, dtype: float64
0.80    109.8
0.90    111.0
0.99    114.0
Name: FILMSP, dtype: float64
   Here is a solution that uses pure Python. Note that the pandas quantile implements
different interpolation methods which will lead to differences for smaller datasets.
We therefore recommend using the library method and select the method that is most
suitable for your use case.
0 66.0
0.25 102.0
0.5 105.0
0.75 109.0
0.8 109.5
0.9 111.0
0.99 114.0
1.0 118.0
Solution 1.14
filmsp = mistat.load_data('FILMSP')
n = len(filmsp)
mean = filmsp.mean()
deviations = [film - mean for film in filmsp]
S = math.sqrt(sum(deviation**2 for deviation in deviations) / n)
print('Python:\n',
      f'Skewness: {skewness}, Kurtosis: {kurtosis}')
print('Pandas:\n',
      f'Skewness: {filmsp.skew()}, Kurtosis: {filmsp.kurtosis()}')
 Python:
  Skewness: -1.8098727695275856, Kurtosis: 9.014427238360716
 Pandas:
  Skewness: -1.8224949285588137, Kurtosis: 6.183511188870432
   The distribution of film speed is negatively skewed and much steeper than the
normal distribution. Note that the calculated values differ between the methods.
Solution 1.15 The pandas groupby method groups the data based on the value. We
can then calculate individual statistics for each group.
car = mistat.load_data('CAR')
car['mpg'].groupby(by=car['origin']).mean()
car['mpg'].groupby(by=car['origin']).std()
# calculate both at the same time
print(car['mpg'].groupby(by=car['origin']).agg(['mean', 'std']))
                 mean         std
 origin
 1        20.931034     3.597573
 2        19.500000     2.623855
 3        23.108108     4.280341
Solution 1.16 We first create a subset of the data frame that contains only US made
cars and then calculate the statistics for this subset only.
car = mistat.load_data('CAR')
car_US = car[car['origin'] == 1]
gamma = car_US['turn'].std() / car_US['turn'].mean()
Solution
car      1.17
    = mistat.load_data('CAR')
car_US = car[car['origin'] == 1]
car_Asia = car[car['origin'] == 3]
print('US')
print('mean', car_US['turn'].mean())
print('geometric mean', stats.gmean(car_US['turn']))
print('Japanese')
print('mean', car_Asia['turn'].mean())
print('geometric mean', stats.gmean(car_Asia['turn']))
 US
 mean 37.203448275862065
 geometric mean 37.06877691910792
 Japanese
 mean 33.04594594594595
 geometric mean 32.97599107553825
   We see that 𝑋¯ is greater than 𝐺. The cars from Asia have smaller mean turn
diameter.
12                                             1 Analyzing Variability: Descriptive Statistics
Solution
filmsp   1.18
       = mistat.load_data('FILMSP')
Xbar = filmsp.mean()
S = filmsp.std()
print(f'mean: {Xbar}, stddev: {S}')
expected = {1: 0.68, 2: 0.95, 3: 0.997}
for k in (1, 2, 3):
  left = Xbar - k * S
  right = Xbar + k * S
  proportion = sum(left < film < right for film in filmsp)
  print(f'X +/- {k}S: ',
        f'actual freq. {proportion}, ',
        f'pred. freq. {expected[k] * len(filmsp):.2f}')
  The discrepancies between the actual frequencies to the predicted frequencies are
due to the fact that the distribution of film speed is neither symmetric nor bell-shaped.
Solution 1.19
car = mistat.load_data('CAR')
car.boxplot(column='mpg', by='origin')
plt.show()
                          Boxplot grouped
                                    mpg by origin
 35.0
 32.5
 30.0
 27.5
 25.0
 22.5
 20.0
 17.5
 15.0
                  1                     2                         3
                                      origin
Solution 1.20
oturb = mistat.load_data('OTURB')
mistat.stemLeafDiagram(oturb, 2, leafUnit=0.01)
1 Analyzing Variability: Descriptive Statistics                           13
                               4 2 3444
                              18 2 55555666677789
                              40 3 0000001111112222333345
                            (15) 3 566677788899999
                              45 4 00022334444
                              34 4 566888999
                              25 5 0112333
                              18 5 6789
                              14 6 01122233444
                               3 6 788
   • 𝑋 (1) = 0.23,
   • 𝑄 1 = 𝑋 (25.25) = 𝑋 (25) + 0.25(𝑋 (26) − 𝑋 (25) ) = 0.31,
   • 𝑀3 = 𝑋 (50.5) = 0.385,
   • 𝑄 3 = 𝑋 (75.75) = 0.49 + 0.75(0.50 − 0.49) = 0.4975,
   • 𝑋 (𝑛) = 0.68.
Solution
from     1.21
     scipy.stats import trim_mean
oturb = mistat.load_data('OTURB')
print(f'T(0.1) = {trim_mean(oturb, 0.1)}')
print(f'S(0.1) = {trim_std(oturb, 0.1)}')
 T(0.1) = 0.40558750000000005
 S(0.1) = 0.09982289003530202
Solution 1.22
germanCars = [10, 10.9, 4.8, 6.4, 7.9, 8.9, 8.5, 6.9, 7.1,
              5.5, 6.4, 8.7, 5.1, 6.0, 7.5]
japaneseCars = [9.4, 9.5, 7.1, 8.0, 8.9, 7.7, 10.5, 6.5, 6.7,
                9.3, 5.7, 12.5, 7.2, 9.1, 8.3, 8.2, 8.5, 6.8, 9.5, 9.7]
# convert to pandas Series
germanCars = pd.Series(germanCars)
japaneseCars = pd.Series(japaneseCars)
# use describe to calculate statistics
comparison = pd.DataFrame({
  'German': germanCars.describe(),
  'Japanese': japaneseCars.describe(),
})
print(comparison)
            German      Japanese
 count   15.000000     20.000000
 mean     7.373333      8.455000
 std      1.780235      1.589596
 min      4.800000      5.700000
 25%      6.200000      7.175000
 50%      7.100000      8.400000
 75%      8.600000      9.425000
 max     10.900000     12.500000
hadpas = mistat.load_data('HADPAS')
sampleStatistics = pd.DataFrame({
  'res3': hadpas['res3'].describe(),
  'res7': hadpas['res7'].describe(),
})
print(sampleStatistics)
                  res3           res7
count       192.000000     192.000000
mean       1965.239583    1857.776042
std         163.528165     151.535930
min        1587.000000    1420.000000
25%        1860.000000    1772.250000
50%        1967.000000    1880.000000
75%        2088.750000    1960.000000
max        2427.000000    2200.000000
Histogram:
ax = hadpas.hist(column='res3', alpha=0.5)
hadpas.hist(column='res7', alpha=0.5, ax=ax)
plt.show()
res7
40
30
20
10
     0
         1400      1600           1800      2000         2200          2400
     We overlay both histograms in one plot and make them transparent (alpha).
     Stem and leaf diagrams:
print('res3')
mistat.stemLeafDiagram(hadpas['res3'], 2, leafUnit=10)
print('res7')
mistat.stemLeafDiagram(hadpas['res7'], 2, leafUnit=10)
res3
            1     15   8
            6     16   01124
           14     16   56788889
1 Analyzing Variability: Descriptive Statistics                              15
       22        17    00000234
       32        17    5566667899
       45        18    0011112233444
       60        18    556666677888899
       87        19    000000111111223333344444444
     (18)        19    566666666667888889
       87        20    00000000122222333334444
       64        20    55556666666677789999
       44        21    0000011222233344444
       25        21    566667788888
       13        22    000111234
        4        22    668
        0        23
        1        24    2
 res7
        1        14    2
        9        15    11222244
       15        15    667789
       23        16    00012334
       30        16    5566799
       40        17    0022233334
       54        17    66666666777999
       79        18    0000222222222233344444444
     (28)        18    5555556666666778888888999999
       85        19    000000001111112222222233333444444
       52        19    566666667777888888889999
       28        20    0000111222333444
       12        20    678
        9        21    1123344
        2        21    8
        1        22    0
Solution 1.24
hadpas.boxplot(column='res3')
plt.show()
2400
2200
2000
1800
 1600
                                 res3
   Lower whisker starts at max(1587, 1511.7) = 1587 = 𝑋 (1) ; upper whisker ends
at min(2427, 2440.5) = 2427 = 𝑋 (𝑛) . There are no outliers.
Chapter 2
Probability Models and Distribution Functions
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
25
20
                    15
                                                  A
                                              B
                2
10
                      0
                          0      5       10           15   20      25
                                                  1
                                                                                     17
                        18                                                 2 Probability Models and Distribution Functions
                                              ( 𝐴 ∩ 𝐵) 𝑐 = ( 𝐴 ∩ 𝐵 𝑐 ) ∪ ( 𝐴𝑐 ∩ 𝐵) ∪ ( 𝐴𝑐 ∩ 𝐵 𝑐 )
                                                           = ( 𝐴 ∩ 𝐵𝑐 ) ∪ 𝐴𝑐
                                                           = 𝐴𝑐 ∪ 𝐵𝑐 .
                                                                                      𝑛
                                                                                                 !
                                                                                      Ø
                                                           𝐵 = 𝐵∩S = 𝐵∩                     𝐴𝑖
                                                                                      𝑖=1
                                                                              𝑛
                                                                              Ø
                                                                          =         𝐴𝑖 𝐵.
                                                                              𝑖=1
Solution 2.7
                                   Pr{ 𝐴 ∪ 𝐵 ∪ 𝐶} = Pr{( 𝐴 ∪ 𝐵) ∪ 𝐶}
                                                      = Pr{( 𝐴 ∪ 𝐵)} + Pr{𝐶} − Pr{( 𝐴 ∪ 𝐵) ∩ 𝐶}
                                                      = Pr{ 𝐴} + Pr{𝐵} − Pr{ 𝐴 ∩ 𝐵} + Pr{𝐶}
                                                         − Pr{ 𝐴 ∩ 𝐶} − Pr{𝐵 ∩ 𝐶} + Pr{ 𝐴 ∩ 𝐵 ∩ 𝐶}
                                                      = Pr{ 𝐴} + Pr{𝐵} + Pr{𝐶} − Pr{ 𝐴 ∩ 𝐵}
                                                             − Pr{ 𝐴 ∩ 𝐶} − Pr{𝐵 ∩ 𝐶} + Pr{ 𝐴 ∩ 𝐵 ∩ 𝐶}.
                                                                                          Ð𝑛
{exc:partition-union}
                        Solution 2.8 We have shown in Exercise 2.6 that 𝐵 = 𝑖=1                 𝐴𝑖 𝐵. Moreover, since
                        { 𝐴1 , . . . , 𝐴𝑛 } is a partition, 𝐴𝑖 𝐵 ∩ 𝐴 𝑗 𝐵 = ( 𝐴𝑖 ∩ 𝐴 𝑗 ) ∩ 𝐵 = ∅ ∩ 𝐵 = ∅ for all 𝑖 ≠ 𝑗.
                        Hence, from Axiom 3
                                                                  ( 𝑛       )     𝑛
                                                                   Ø             ∑︁
                                                     Pr{𝐵} = Pr         𝐴𝑖 𝐵 =       Pr{ 𝐴𝑖 𝐵}.
                                                                    𝑖=1               𝑖=1
2 Probability Models and Distribution Functions                                      19
Solution 2.9
                    S = {(𝑖 1 , 𝑖2 ) : 𝑖 𝑗 = 1, . . . , 6, 𝑗 = 1, 2}
                    𝐴 = {(𝑖 1 , 𝑖2 ) : 𝑖 1 + 𝑖2 = 10} = {(4, 6), (5, 5), (6, 4)}
                                 3         1
                    Pr{𝐴} =          =       .
                                36 12
Solution 2.10
                                                              
                                               150           280
       Pr{𝐵} = Pr{ 𝐴150 } − Pr{ 𝐴280 } = exp −       − exp −       = 0.2258.
                                               200           200
Solution 2.11
                                     10 10 15 5
                                     2   2     2 2
                                           40
                                                        = 0.02765
                                           8
                                             100
Solution 2.12 (i) 1005 = 1010 ; (ii)          5     = 75, 287, 520.
Solution 2.13 𝑁 = 1000, 𝑀 = 900, 𝑛 = 10.
                                         10  
                                         ∑︁  10
                    (i) Pr{𝑋 ≥ 8} =                  (0.9) 𝑗 (0.1) 10− 𝑗 = 0.9298.
                                          𝑗=8
                                                𝑗
                                         10     900 100 
                                                     10− 𝑗
                                         ∑︁      𝑗
                    (ii) Pr{𝑋 ≥ 8} =               1000
                                                             = 0.9308.
                                          𝑗=8       10
Solution 2.19
Solution 2.21 Let 𝑛 be the number of people in the party. The probability
                                                                               that all
                                                    Î𝑛     365 − 𝑗 + 1
their birthdays fall on different days is Pr{𝐷 𝑛 } = 𝑗=1                   .
                                                               365
   (i) If 𝑛 = 10, Pr{𝐷 10 } = 0.8831.
   (ii) If 𝑛 = 23, Pr{𝐷 23 } = 0.4927. Thus, the probability of at least 2 persons with
the same birthday, when 𝑛 = 23, is 1 − Pr{𝐷 23 } = 0.5073 > 12 .
                          10
              7
Solution 2.22                     = 0.02825.
              10
                                        
                Î10               1
Solution 2.23                     1−       = 0.5833.
                     𝑗=1
                              24 − 𝑗 + 1
                       4 86                   4 10                86                         86
                          1      4                           1    1   3                            5
Solution 2.24 (i)             100
                                       = 0.1128; (ii)            100
                                                                            = 0.0544; (iii) 1 −   100
                                                                                                         =
                               5                                  5                                5
0.5374.
                    4
Solution 2.25       10
                           = 0.0889.
                    2
Solution 2.26 The sample median is 𝑋 (6) , where 𝑋 (1) < · · · < 𝑋 (11) are the ordered
                               𝑘−1 20−𝑘 
                                               5         5
sample values. Pr{𝑋 (6) = 𝑘 } =                    20
                                                             , 𝑘 = 6, . . . , 15. This is the probability
                                                   11
distribution of the sample median. The probabilities are
    𝑘     6       7      8      9      10     11       12                          13     14      15
     Pr .01192 .04598 .09902 .15404 .18905 .18905 .15404 .09902 .04598 .01192
Solution 2.27 Without loss of generality, assume that the stick is of length 1. Let 𝑥, 𝑦
and (1 − 𝑥 − 𝑦), 0 < 𝑥, 𝑦 < 1, be the length of the 3 pieces. Obviously, 0 < 𝑥 + 𝑦 < 1.
All points in S = {(𝑥, 𝑦) : 𝑥, 𝑦 > 0, 𝑥 + 𝑦 < 1} are uniformly distributed. In order
2 Probability Models and Distribution Functions                                                    21
that the three pieces can form a triangle, the following three conditions should be
satisfied:
   (i) 𝑥 + 𝑦 > (1 − 𝑥 − 𝑦)
   (ii) 𝑥 + (1 − 𝑥 − 𝑦) > 𝑦
   (iii) 𝑦 + (1 − 𝑥 − 𝑦) > 𝑥.
   The set of points (𝑥, 𝑦) satisfying (i), (ii) and (iii) is bounded by a triangle of area
1/8. S is bounded by a triangle of area 1/2. Hence, the required probability is 1/4.
Solution 2.28 Consider Fig. 2.1. Suppose that the particle is moving along the                           {exc:geometrySolution}
circumference of the circle in a counterclockwise direction. Then,√ using the notation
in the diagram, Pr{hit} = 𝜙/2𝜋. Since 𝑂𝐷 = 1 = 𝑂𝐸, 𝑂𝐵 = 𝑎 2 + ℎ2 = 𝑂𝐶 and
the lines 𝐷𝐵 and 𝐸𝐶 are tangential to the circle, it follows that the triangles Δ𝑂𝐷𝐵
and Δ𝑂𝐸𝐶 are congruent.   Thus
                                 𝑚(∠ 𝐷𝑂𝐵) = 𝑚(∠ 𝐸𝑂𝐶), and it is easily seen that
                          ℎ                           1        ℎ
𝜙 = 2𝛼. Now 𝛼 = tan−1         , and hence, Pr{hit} = tan−1         .
                          𝑎                           𝜋        𝑎
                                            α
                             O                                                               A
                                            α               a
                                                E
                                                                                             B
                                 D
                                                                      100
Solution 2.29 1−(0.999) 100 −100×(0.001)×(0.999) 99 −                            2
                                                                       2 ×(0.001) (0.999)
                                                                                          98        =
0.0001504.
                                                                                        𝑛
                                                                                𝑛−1    1
Solution 2.30 The probability that 𝑛 tosses are required is 𝑝(𝑛) =               1          , 𝑛 ≥ 2.
                                                                                        2
                     1   3
Thus, 𝑝(4) = 3 ·       =   .
                     24 16
Solution 2.31 S = {(𝑖1 , . . . , 𝑖 10 ) : 𝑖 𝑗 = 0, 1, 𝑗 = 1, . . . , 10}. One random variable
is the number of 1’s in an element, i.e., for 𝝎 = (𝑖 1 , . . . , 𝑖 10 ) 𝑋1 (𝝎) = 10
                                                                                           Í
                                                                                             𝑗=1 𝑖 𝑗 .
Another random   variable    is   the   number     of  zeros to  the     left of the 1st one,  i.e.,
𝑋2 (𝝎) = 10
         Í Î𝑗
            𝑗=1  𝑘=1 (1 −  𝑖 𝑘 ).   Notice    that 𝑋 2 (𝝎) =  0  if  𝑖 1  =  1 and 𝑋 2 (𝝎) =  10   if
                          22                                               2 Probability Models and Distribution Functions
                                                                                                                             10 10
                          𝑖 1 = 𝑖2 = . . . = 𝑖10 = 0. The probability distribution of 𝑋1 is Pr{𝑋1 = 𝑘 } =                    𝑘 /2 ,
                          𝑘 = 0, 1, . . . , 10. The probability distribution of 𝑋2 is
                                                                     𝑘+1
                                                                   1
                                                                                        𝑘 = 0, . . . , 9
                                                                 
                                                                 
                                                                 
                                                                  2
                                                                          ,
                                                    Pr{𝑋2 = 𝑘 } =   10
                                                                  1
                                                                 
                                                                        ,              𝑘 = 10.
                                                                  2
                                                                 
                                                    Í∞ 5 𝑥
                                                              = 𝑒 5 , we have ∞
                                                                             Í
                          Solution 2.32 (i) Since      𝑥=0                     𝑥=0 𝑝(𝑥) = 1. ; (ii) Pr{𝑋 ≤ 1} =
                                                           𝑥!
                          𝑒 −5 (1 + 5) = 0.0404; (iii) Pr{𝑋 ≤ 7} = 0.8666.
                          Solution 2.33 (i) Pr{𝑋 = −1} = 0.3; (ii) Pr{−0.5 < 𝑋 < 0} = 0.1; (iii) Pr{0 ≤ 𝑋 <
                          0.75} = 0.425; (iv) Pr{𝑋 = 1} = 0; (v) 𝐸 {𝑋 } = −0.25, 𝑉 {𝑋 } = 0.4042.
                          Solution 2.34
                                                         ∞                          ∞
                                                    ∫                           ∫                                 √︂
                                                                                            −𝑥 2 /2𝜎 2                 𝜋
                                         𝐸 {𝑋 } =            (1 − 𝐹 (𝑥)) d𝑥 =           𝑒                d𝑥 = 𝜎          .
                                                     0                          0                                      2
                          Solution 2.35
                                                 𝑁                                     𝑁
                                              1 ∑︁     𝑁 +1                         1 ∑︁ 2 (𝑁 + 1) (2𝑁 + 1)
                                   𝐸 {𝑋 } =         𝑖=      ;          𝐸 {𝑋 2 } =         𝑖 =
                                              𝑁 𝑖=1      2                          𝑁 𝑖=1          6
                              When 𝑗 = 𝑙 the term is (−1) 𝑙 𝜇1𝑙 . When 𝑗 = 𝑙 − 1 the term is (−1) 𝑙−1 𝑙 𝜇1𝑙−1 𝜇1 =
                          (−1) 𝑙−1 𝑙 𝜇1𝑙 . Thus, the sum of the last 2 terms is (−1) 𝑙−1 (𝑙 − 1)𝜇1𝑙 and we have
                              𝜇𝑙∗ = 𝑙−2
                                      Í          𝑗 𝑙
                                                      𝑗
                                        𝑗=0 (−1) 𝑗 𝜇1 𝜇𝑙− 𝑗 + (−1)
                                                                    𝑙−1 (𝑙 − 1)𝜇𝑙 .
                                                                                1
{exc:mixed-cdf-example}   Solution 2.39 We saw in the solution of Exercise 2.33 that 𝜇1 = −0.25. Moreover,
                             𝜇2 = 𝑉 {𝑋 } + 𝜇12 = 0.4667.
                                                        1
                          Solution 2.40 𝑀𝑋 (𝑡) =             (𝑒 𝑡 𝑏 − 𝑒 𝑡 𝑎 ), −∞ < 𝑡 < ∞, 𝑎 < 𝑏.
                                                   𝑡 (𝑏 − 𝑎)
                                        𝑎+𝑏              (𝑏 − 𝑎) 2
                               𝐸 {𝑋 } =     ; 𝑉 {𝑋 } =             .
                                         2                  12
2 Probability Models and Distribution Functions                                      23
                                                             𝑡 −1
                                                               
Solution 2.41 (i) For 𝑡 < 𝜆 we have 𝑀 (𝑡) = 1 −              𝜆    ,
                         1     𝑡  −2                                       1
                𝑀 ′ (𝑡) =   1−         ,                      𝜇1 = 𝑀 ′ (0) =
                         𝜆     𝜆                                             𝜆
                          2     𝑡  −3                                       2
               𝑀 ′′ (𝑡) = 2 1 −          ,                    𝜇2 = 𝑀 ′′ (0) = 2
                         𝜆       𝜆                                            𝜆
                          6     𝑡  −4                                         6
              𝑀 (3) (𝑡) = 3 1 −          ,                    𝜇3 = 𝑀 (3) (0) = 3
                         𝜆       𝜆                                             𝜆
                         24      𝑡  −5                                        24
              𝑀 (4) (𝑡) = 4 1 −          ,                           (4)
                                                              𝜇4 = 𝑀 (0) = 4 .
                         𝜆       𝜆                                              𝜆
     (ii) The central moments are
        𝜇1∗ = 0,
               1
        𝜇2∗ = 2 ,
              𝜆
                                    6      6    2    2
        𝜇3∗ = 𝜇3 − 3𝜇2 𝜇1 + 2𝜇13 =      −     +    = ,
                                   𝜆3 𝜆3 𝜆3 𝜆3
                                                 1                            9
        𝜇4∗ = 𝜇4 − 4𝜇3 𝜇1 + 6𝜇2 (𝜇1 ) 2 − 3𝜇14 = 4 (24 − 4 · 6 + 6 · 2 − 3) = 4 .
                                                𝜆                            𝜆
                                             𝜇4∗
     (iii) The index of kurtosis is 𝛽4 =              = 9.
                                           (𝜇2∗ ) 2
x = list(range(15))
table = pd.DataFrame({
  'x': x,
  'p.d.f.': [stats.binom(20, 0.17).pmf(x) for x in x],
  'c.d.f.': [stats.binom(20, 0.17).cdf(x) for x in x],
})
print(table)
        x         p.d.f.      c.d.f.
0       0   2.407475e-02    0.024075
1       1   9.861947e-02    0.122694
2       2   1.918921e-01    0.314586
3       3   2.358192e-01    0.550406
4       4   2.052764e-01    0.755682
5       5   1.345426e-01    0.890224
6       6   6.889229e-02    0.959117
7       7   2.822094e-02    0.987338
8       8   9.392812e-03    0.996731
9       9   2.565105e-03    0.999296
10     10   5.779213e-04    0.999874
11     11   1.076086e-04    0.999981
12     12   1.653023e-05    0.999998
13     13   2.083514e-06    1.000000
14     14   2.133719e-07    1.000000
Solution 2.45 Pr{no defective chip on the board} = 𝑝 50 . Solving 𝑝 50 = 0.99 yields
   𝑝 = (0.99) 1/50 = 0.999799.
                                                                         𝑛
                                                                       𝜆
Solution 2.46 Notice first that lim 𝑛→∞ 𝑏(0; 𝑛, 𝑝) = lim𝑛→∞ 1 −              = 𝑒 −𝜆 .
                                        𝑛 𝑝→𝜆                           𝑛
Moreover,                           for                    all
                         𝑏( 𝑗 + 1; 𝑛, 𝑝)     𝑛− 𝑗   𝑝
𝑗 = 0, 1, · · · , 𝑛 − 1,                   =      ·     . Thus, by induction on 𝑗,
                           𝑏( 𝑗; 𝑛, 𝑝)        𝑗 +1 1− 𝑝
for 𝑗 > 0
                                                             𝑛− 𝑗 +1    𝑝
                                   lim 𝑏( 𝑗 − 1; 𝑛, 𝑝)
                lim 𝑏( 𝑗; 𝑛, 𝑝) = 𝑛→∞                                ·
                𝑛→∞
                𝑛 𝑝→𝜆                𝑛 𝑝→𝜆
                                                                𝑗      1− 𝑝
                                             𝜆 𝑗 −1        (𝑛 − 𝑗 + 1) 𝑝
                                  = 𝑒 −𝜆               lim
                                           ( 𝑗 − 1)! 𝑛𝑛→∞           𝜆
                                                                      
                                                       𝑝→𝜆 𝑗 1 − 𝑛
                                             𝜆 𝑗 −1  𝜆      𝜆𝑗
                                  = 𝑒 −𝜆            · = 𝑒 −𝜆 .
                                           ( 𝑗 − 1)! 𝑗      𝑗!
Solution 2.50
                                        3
                                       ∑︁
Pr{𝑅} = 1 − 𝐻 (3; 100, 10, 20) +             ℎ(𝑖; 100, 10, 20) [1 − 𝐻 (3 − 𝑖; 80, 10 − 𝑖, 40)]
                                       𝑖=1
         = 0.87395.
Solution 2.51 The m.g.f. of the Poisson distribution with parameter 𝜆, 𝑃(𝜆), is
                 𝜇1 = 𝜆                                       𝜇1∗ = 0
                 𝜇2 = 𝜆2 + 𝜆                                  𝜇2∗ = 𝜆
                 𝜇3 = 𝜆3 + 3𝜆2 + 𝜆                            𝜇3∗ = 𝜆
                 𝜇4 = 𝜆4 + 6𝜆3 + 7𝜆2 + 𝜆                      𝜇4∗ = 3𝜆2 + 𝜆.
                                                                          1
   Thus, the indexes of skewness and kurtosis are 𝛽3 = 𝜆 −1/2 and 𝛽4 = 3 + .
                                                                          𝜆
   For 𝜆 = 10 we have 𝛽3 = 0.3162 and 𝛽4 = 3.1.
Solution 2.52 Let 𝑋 be the number of blemishes observed. Pr{𝑋 > 2} = 0.1912.
Solution 2.53 Using the Poisson approximation with 𝑁 = 8000 and 𝑝 = 380 × 10−6 ,
we have 𝜆 = 3.04 and Pr{𝑋 > 6} = 0.0356, where 𝑋 is the number of insertion
errors in 2 hours of operation.
Solution 2.55 Using Python we obtain that for the 𝑁 𝐵( 𝑝, 𝑘) with 𝑝 = 0.01 and
𝑘 = 3, 𝑄 1 = 170, 𝑀𝑒 = 265, and 𝑄 3 = 389.
                           𝜇1∗ = 0
                                 1− 𝑝
                           𝜇2∗ =      ,
                                   𝑝2
                                  1
                           𝜇3∗ = 3 (1 − 𝑝) (2 − 𝑝),
                                 𝑝
                                  1
                           𝜇4∗ = 4 (9 − 18𝑝 + 10𝑝 2 − 𝑝 3 ).
                                 𝑝
                                                        2− 𝑝
   Thus the indices of skewness and kurtosis are 𝛽3∗ = √︁     and 𝛽4∗ =
                                                         1− 𝑝
9 − 9𝑝 + 𝑝 2
             .
   1− 𝑝
Solution 2.58 If there are 𝑛 chips, 𝑛 > 50, the probability of at least 50 good ones is
1 − 𝐵(49; 𝑛, 0.998). Thus, 𝑛 is the smallest integer > 50 for which 𝐵(49; 𝑛, 0.998) <
0.05. It is sufficient to order 51 chips.
Solution 2.59 If 𝑋 has a geometric distribution then, for every 𝑗, 𝑗 = 1, 2, . . .
   Pr{𝑋 > 𝑗 } = (1 − 𝑝) 𝑗 . Thus,
                                                 Pr{𝑋 > 𝑛 + 𝑚}
                     Pr{𝑋 > 𝑛 + 𝑚 | 𝑋 > 𝑚} =
                                                   Pr{𝑋 > 𝑚}
                                                 (1 − 𝑝) 𝑛+𝑚
                                               =
                                                  (1 − 𝑝) 𝑚
                                               = (1 − 𝑝) 𝑛
                                               = Pr{𝑋 > 𝑛}.
Solution 2.60 For 0 < 𝑦 < 1, Pr{𝐹 (𝑋) ≤ 𝑦} = Pr{𝑋 ≤ 𝐹 −1 (𝑦)} = 𝐹 (𝐹 −1 (𝑦)) = 𝑦.
Hence, the distribution of 𝐹 (𝑋) is uniform on (0, 1). Conversely, if 𝑈 has a uniform
distribution on (0, 1), then
                                                    1600
Solution 2.61 𝐸 {𝑈 (10, 50)} = 30; 𝑉 {𝑈 (10, 50)} =      = 133.33;
                                                     12
                     40
   𝜎{𝑈 (10, 50)} = √ = 11.547.
                    2 3
Solution 2.62 Let 𝑋 = − log(𝑈) where 𝑈 has a uniform distribution on (0,1).
Solution 2.63 (i) Pr{92 < 𝑋 < 108} = 0.4062; (ii) Pr{𝑋 > 105} = 0.3694;
   (iii) Pr{2𝑋 + 5 < 200} = Pr{𝑋 < 97.5} = 0.4338.
rv = stats.norm(100, 15)
print('(i)', rv.cdf(108) - rv.cdf(92))
print('(ii)', 1 - rv.cdf(105))
print('(iii)', rv.cdf((200 - 5)/2))
(i) 0.4061971427922976
(ii) 0.36944134018176367
(iii) 0.43381616738909634
Solution 2.64 Let 𝑧 𝛼 denote the 𝛼 quantile of a 𝑁 (0, 1) distribution. Then the two
equations 𝜇 + 𝑧 .9 𝜎 = 15 and 𝜇 + 𝑧 .99 𝜎 = 20 yield the solution 𝜇 = 8.8670 and
𝜎 = 4.7856.
Solution 2.65 Due to symmetry, Pr{𝑌 > 0} = Pr{𝑌 < 0} = Pr{𝐸 < 𝑣}, where
𝐸 ∼ 𝑁 (0, 1). If the probability of a bit error is 𝛼 = 0.01, then Pr{𝐸 < 𝑣} = Φ(𝑣) =
1 − 𝛼 = 0.99.
   Thus 𝑣 = 𝑧 .99 = 2.3263.
Solution 2.66 Let 𝑋 𝑝 denote the diameter of an aluminum pin and 𝑋ℎ denote the size
of a hole drilled in an aluminum plate. If 𝑋 𝑝 ∼ 𝑁 (10, 0.02) and 𝑋ℎ ∼ 𝑁 (𝜇 𝑑 , 0.02)
then the probability that the pin will not enter the hole is Pr{𝑋ℎ − 𝑋 𝑝 < 0}. Now
                             √
𝑋ℎ − 𝑋 𝑝 ∼ 𝑁 (𝜇 𝑑 − 10, 0.022 + 0.022 ) and for Pr{𝑋ℎ − 𝑋 𝑝 < 0} = 0.01, we
obtain 𝜇 𝑑 = 10.0658 mm. (The fact that the sum of two independent normal random
variables is normally distributed should be given to the student since it has not yet
been covered in the text.)
Solution 2.67 For 𝑋1 , . . . , 𝑋𝑛 i.i.d. 𝑁 (𝜇, 𝜎 2 ), 𝑌 = 𝑖=1 𝑖𝑋𝑖 ∼ 𝑁 (𝜇𝑌 , 𝜎𝑌2 ) where
                                                          Í𝑛
                       𝑛(𝑛 + 1)                                 𝑛(𝑛 + 1) (2𝑛 + 1)
                                  and 𝜎𝑌2 = 𝜎 2 𝑖=1
                                                    Í𝑛 2
                                                         𝑖 = 𝜎2
           Í𝑛
   𝜇𝑌 = 𝜇 𝑖=1    𝑖=𝜇                                                              .
                           2                                           6
Solution 2.68 Pr{𝑋 > 300} = Pr{log 𝑋 > 5.7038} = 1 − Φ(0.7038) = 0.24078.
                       28                                              2 Probability Models and Distribution Functions
                                                                                                                         2 𝑡 2 /2
                       Solution 2.69 For 𝑋 ∼ 𝑒 𝑁 ( 𝜇, 𝜎) , 𝑋 ∼ 𝑒𝑌 where 𝑌 ∼ 𝑁 (𝜇, 𝜎), 𝑀𝑌 (𝑡) = 𝑒 𝜇𝑡+𝜎                               .
                                                                                                              2 /2
                                              𝜉 = 𝐸 {𝑋 } = 𝐸 {𝑒𝑌 } = 𝑀𝑌 (1) = 𝑒 𝜇+𝜎                                  .
                                                                                     2
                            Since 𝐸 {𝑋 2 } = 𝐸 {𝑒 2𝑌 } = 𝑀𝑌 (2) = 𝑒 2𝜇+2𝜎 we have
                                                                                 2                 2
                                                      𝑉 {𝑋 } = 𝑒 2𝜇+2𝜎 − 𝑒 2𝜇+𝜎
                                                                         2               2
                                                               = 𝑒 2𝜇+𝜎 (𝑒 𝜎 − 1)
                                                                             2
                                                               = 𝜉 2 (𝑒 𝜎 − 1).
Solution 2.74
                               ∫ 𝑡
                         𝜆𝑘
       𝐺 (𝑡; 𝑘, 𝜆) =                 𝑥 𝑘−1 𝑒 −𝜆𝑥 d𝑥
                     (𝑘 + 1)! 0
                                          ∫ 𝑡
                     𝜆 𝑘 𝑘 −𝜆𝑡 𝜆 𝑘+1
                   =     𝑡 𝑒    +              𝑥 𝑘 𝑒 −𝜆𝑥 d𝑥
                     𝑘!              𝑘! 0
                                                                    ∫ 𝑡
                     𝜆 𝑘 𝑘 −𝜆𝑡        𝜆 𝑘+1 𝑘+1 −𝜆𝑡           𝜆 𝑘+2
                   =     𝑡 𝑒    +             𝑡 𝑒        +              𝑥 𝑘+1 𝑒 −𝜆𝑥 d𝑥
                     𝑘!             (𝑘 + 1)!                (𝑘 + 1)! 0
                   = ···
                            ∞
                           ∑︁  (𝜆𝑡) 𝑗
                   = 𝑒 −𝜆𝑡
                           𝑗=𝑘
                                 𝑗!
                                 𝑘−1
                                 ∑︁  (𝜆𝑡) 𝑗
                   = 1 − 𝑒 −𝜆𝑡              .
                                 𝑗=0
                                       𝑗!
                                                        
                                   1                3   1   1
Solution 2.75 Γ(1.17) = 0.9267, Γ     = 1.77245, Γ     = Γ     = 0.88623.
                                   2                2   2   2
from scipy.special import gamma
print(gamma(1.17), gamma(1 / 2), gamma(3 / 2))
Solution 2.76 The moment generating function of the sum of independent random
variables is the product of their respective m.g.f.’s.ÎThus, if 𝑋1 , · · · , 𝑋 𝑘 are i.i.d.
𝐸 (𝛽), using the result of Exercise 2.73, 𝑀𝑆 (𝑡) = 𝑖=1  𝑘
                                                          (1 − 𝛽𝑡) −1 = (1 − 𝛽𝑡) −𝑘 ,         {exc:prob-ind-exp-rv}
    1
𝑡 < , where 𝑆 = 𝑖=1 𝑋𝑖 . On the other hand, (1 − 𝛽𝑡) −𝑘 is the m.g.f. of 𝐺 (𝑘, 𝛽).
                   Í𝑘
    𝛽
Solution 2.77 The expected value and variance of 𝑊 (2, 3.5) are
                                           
                                          1
             𝐸 {𝑊 (2, 3.5)} = 3.5 × Γ 1 +     = 3.1018,
                                          2
                                                       
                                            2            1
             𝑉 {𝑊 (2, 3.5)} = (3.5) 2 Γ 1 +     − Γ2 1 +      = 2.6289.
                                            2            2
Solution 2.78 Let 𝑇 be the number of days until failure. 𝑇 ∼ 𝑊 (1.5, 500) ∼
500𝑊 (1.5, 1).                     
                                  6              1.5
   Pr{𝑇 ≥ 600} = Pr 𝑊 (1.5, 1) ≥      = 𝑒 − (6/5) = 0.2686.
                                  5
                               
                            1 3
Solution 2.79 Let 𝑋 ∼ Beta , .
                            2 2
                                 1 3
             1/2    1             ·       1              1
   𝐸 {𝑋 } = 1 3 = , 𝑉 {𝑋 } = 22 2 =           and 𝜎{𝑋 } = .
            2 + 2
                    4           2 · 3 16                 4
30                                             2 Probability Models and Distribution Functions
Solution 2.80 Let 𝑋 ∼ Beta(𝜈, 𝜈). The first four moments are
                                1
                    𝜇1 = 𝜈/2𝜈 =
                                2
                         𝐵(𝜈 + 2, 𝜈)     𝜈+1
                    𝜇2 =             =
                          𝐵(𝜈, 𝜈)      2(2𝜈 + 1)
                         𝐵(𝜈 + 3, 𝜈)     (𝜈 + 1) (𝜈 + 2)
                    𝜇3 =             =
                          𝐵(𝜈, 𝜈)      2(2𝜈 + 1) (2𝜈 + 2)
                         𝐵(𝜈 + 4, 𝜈)      (𝜈 + 1) (𝜈 + 2) (𝜈 + 3)
                    𝜇4 =             =                             .
                          𝐵(𝜈, 𝜈)      2(2𝜈 + 1) (2𝜈 + 2) (2𝜈 + 3)
                                1
     The variance is 𝜎 2 =              and the fourth central moment is
                            4(2𝜈 + 1)
                                                       3
     𝜇4∗ = 𝜇4 − 4𝜇3 · 𝜇1 + 6𝜇2 · 𝜇12 − 3𝜇14 =                     .
                                               16(3 + 8𝜈 + 4𝜈 2 )
                                               ∗
                                             𝜇     3(1 + 2𝜈)
     Finally, the index of kurtosis is 𝛽2 = 44 =             .
                                             𝜎       3 + 2𝜈
Solution 2.81 Let (𝑋, 𝑌 ) have a joint p.d.f.
                                             1,
                                            
                                                   (𝑥, 𝑦) ∈ 𝑆
                                            
                                            
                               𝑓 (𝑥, 𝑦) =     2
                                             0,
                                                  otherwise.
                                            
     (i) The marginal distributions of 𝑋 and 𝑌 have p.d.f.’s
                1 ∫ 1− | 𝑥 |
      𝑓 𝑋 (𝑥) =              d𝑦 = 1 − |𝑥|, −1 < 𝑥 < 1, and by symmetry,
                2 −1+| 𝑥 |
      𝑓𝑌 (𝑦) = 1 − |𝑦|, −1 < 𝑦 < 1.
                                                     ∫1                         1
     (ii) 𝐸 {𝑋 } = 𝐸 {𝑌 } = 0, 𝑉 {𝑋 } = 𝑉 {𝑌 } = 2 0 𝑦 2 (1 − 𝑦) d𝑦 = 2𝐵(3, 2) = .
                                                                                6
Solution 2.82 The marginal p.d.f. of 𝑌 is 𝑓 (𝑦) = 𝑒 −𝑦 , 𝑦 > 0, that is, 𝑌 ∼ 𝐸 (1). The
conditional p.d.f. of 𝑋, given 𝑌 = 𝑦, is 𝑓 (𝑥 | 𝑦) = 1𝑦 𝑒 − 𝑥/𝑦 which is the p.d.f. of
an exponential with parameter 𝑦. Thus, 𝐸 {𝑋 | 𝑌 = 𝑦} = 𝑦, and 𝐸 {𝑋 } = 𝐸 {𝐸 {𝑋 |
𝑌 }} = 𝐸 {𝑌 } = 1. Also,
                               𝐸 {𝑋𝑌 } = 𝐸 {𝑌 𝐸 {𝑋 | 𝑌 }}
                                        = 𝐸 {𝑌 2 }
                                        = 2.
                                     𝜎𝑋2 = 𝐸 {𝑉 {𝑋 | 𝑌 }} + 𝑉 {𝐸 {𝑋 | 𝑌 }}
                                        = 𝐸 {𝑌 2 } + 𝑉 {𝑌 }
                                        = 2 + 1 = 3.
                                              1
   The correlation between 𝑋 and 𝑌 is 𝜌 𝑋𝑌 = √ .
                                               3
                                                                  
                                                                  
                                                                   2,   if (𝑥, 𝑦) ∈ 𝑇
                                                                  
                                                                  
Solution 2.83 Let (𝑋, 𝑌 ) have joint p.d.f. 𝑓 (𝑥, 𝑦) =
                                                                  
                                                                   0,
                                                                  
                                                                         otherwise.
                                                                  
   The marginal densities of 𝑋 and 𝑌 are
    𝑓 𝑋 (𝑥) = 2(1 − 𝑥), 0 ≤ 𝑥 ≤ 1
    𝑓𝑌 (𝑦) = 2(1 − 𝑦),               0 ≤ 𝑦 ≤ 1.
                                             1      1
   Notice that 𝑓 (𝑥, 𝑦) ≠ 𝑓 𝑋 (𝑥) 𝑓𝑌 (𝑦) for 𝑥 =
                                               , 𝑦 = . Thus, 𝑋 and 𝑌 are dependent.
                                             2      4
                                                    1
   cov(𝑋, 𝑌 ) = 𝐸 {𝑋𝑌 } − 𝐸 {𝑋 }𝐸 {𝑌 } = 𝐸 {𝑋𝑌 } − .
                                                    9
               ∫     ∫   1           1− 𝑥
   𝐸 {𝑋𝑌 } = 2               𝑥              𝑦 d𝑦 d𝑥
                     0           0
                 ∫   1
             =           𝑥(1 − 𝑥) 2 d𝑥
                 0
                        1
             = 𝐵(2, 3) =  .
                       12
                        1   1 1
   Hence, cov(𝑋, 𝑌 ) =    − =− .
                       12 9   36
Solution 2.84 𝐽 | 𝑁 ∼ 𝐵(𝑁, 𝑝); 𝑁 ∼ 𝑃(𝜆). 𝐸 {𝑁 } = 𝜆, 𝑉 {𝑁 } = 𝜆, 𝐸 {𝐽} = 𝜆𝑝.
 𝑉 {𝐽} = 𝐸 {𝑉 {𝐽 | 𝑁 }} + 𝑉 {𝐸 {𝐽 | 𝑁 }}                        𝐸 {𝐽 𝑁 } = 𝐸 {𝑁 𝐸 {𝐽 | 𝑁 }}
        = 𝐸 {𝑁 𝑝(1 − 𝑝)} + 𝑉 {𝑁 𝑝}                                       = 𝑝𝐸 {𝑁 2 }
        = 𝜆𝑝(1 − 𝑝) + 𝑝 2 𝜆 = 𝜆𝑝.                                        = 𝑝(𝜆 + 𝜆2 )
                                                       𝑝𝜆 √
   Hence, cov(𝐽, 𝑁) = 𝑝𝜆(1 + 𝜆) − 𝑝𝜆2 = 𝑝𝜆 and 𝜌 𝐽 𝑁 = √ = 𝑝.
                                                      𝜆 𝑝
Solution 2.85 Let 𝑋 ∼ 𝐺 (2, 100) ∼ 100𝐺 (2, 1) and𝑌 ∼ 𝑊 (1.5, 500) ∼ 500𝑊 (1.5, 1).
Then 𝑋𝑌 ∼ 5 × 104 𝐺 (2,1) · 𝑊                                8
                               (1.5, 1) and 𝑉 {𝑋𝑌 } = 25 × 10 · 𝑉 {𝐺𝑊 }, where
                           3
𝐺 ∼ 𝐺 (2, 1) and 𝑊 ∼ 𝑊 , 1 .
                           2
   𝑉 {𝐺𝑊 } = 𝐸 {𝐺 2 }𝑉 {𝑊 } + 𝐸 2 {𝑊 }𝑉 {𝐺}
                                                   
                        4             2                 2
            =6 Γ 1+         − Γ2 1 +       + 2 · Γ2 1 +
                        3             3                 3
           = 3.88404.
   Thus 𝑉 {𝑋𝑌 } = 9.7101 × 109 .
                           32                                                        2 Probability Models and Distribution Functions
{ex:pdf-hypergeom-joint}   Solution 2.87 Using the notation of Example 2.34, the joint p.d.f. of 𝐽1 and 𝐽2 is
                                          20 50   30    
                                                  𝑗1   𝑗2     20− 𝑗1 − 𝑗2
                                𝑝( 𝑗 1 , 𝑗2 ) =             100
                                                                            ,   0 ≤ 𝑗 1 , 𝑗2 ; 𝑗1 + 𝑗2 ≤ 20.
                                                             20
                              The marginal distribution of 𝐽1 is 𝐻 (100, 20, 20). The marginal distribution of 𝐽2
                           is 𝐻 (100, 50, 20). Accordingly,         
                                                                  19
                               𝑉 {𝐽1 } = 20 × 0.2 × 0.8 × 1 −          = 2.585859,
                                                                  99
                                                                    
                                                                  19
                               𝑉 {𝐽2 } = 20 × 0.5 × 0.5 × 1 −          = 4.040404.
                                                                  99
                              The conditional distribution of 𝐽1 , given 𝐽2 , is 𝐻 (50, 20, 20 − 𝐽2 ). Hence,
                               𝐸 {𝐽1 𝐽2 } = 𝐸 {𝐸 {𝐽1 𝐽2 | 𝐽2 }}
                                                                 
                                                                2
                                          = 𝐸 𝐽2 (20 − 𝐽2 ) ×
                                                                5
                                             = 8𝐸 {𝐽2 } − 0.4𝐸 {𝐽22 }
                                         = 80 − 0.4 × 104.040404 = 38.38381
                                and cov(𝐽1 , 𝐽2 ) = −1.61616.
                                                                                                               −1.61616
                                Finally, the correlation between 𝐽1 and 𝐽2 is 𝜌 = √                                               =
                                                                                                        2.585859 × 4.040404
                           −0.50.
                               10
                               ∑︁                                𝑘𝑥
   Pr{𝑋 (1) ≤ 𝑥 | 𝐽 ≥ 1} =              𝑏(𝑘; 10, 0.95) (1 − 𝑒 − 10 )
                               𝑘=1
                                        10  
                                        ∑︁  10              𝑥
                             =1−                  (0.95𝑒 − 10 ) 𝑘 (0.05) (10−𝑘 )
                                        𝑘=1
                                              𝑘
                             = 1 − [0.05 + 0.95𝑒 − 𝑥/10 ] 10 + (0.05) 10
                             = 1 − (0.05 + 0.95𝑒 − 𝑥/10 ) 10 .
                       5       
                      ∑︁       5𝑗     1
              = 2772     (−1)
                     𝑗=0
                               𝑗  𝜆(6 + 𝑗) 2
              = 0.73654/𝜆.
Solution 2.95 Let 𝑈 ∼ 𝑁 (0, 1) and 𝑋 ∼ 𝑁 (𝜇, 𝜎). We assume that 𝑈 and 𝑋 are
independent. Then Φ(𝑋) = Pr{𝑈 < 𝑋 | 𝑋 } and therefore
   𝐸 {Φ(𝑋)} = 𝐸 {Pr{𝑈 < 𝑋 | 𝑋 }}
               = Pr{𝑈 < 𝑋 }
              = Pr{𝑈 − 𝑋 < 0}
                             
                         𝜇
              =Φ √              .
                       1 + 𝜎2                                 √
   The last equality follows from the fact that 𝑈 − 𝑋 ∼ 𝑁 (−𝜇, 1 + 𝜎 2 ).
34                                                     2 Probability Models and Distribution Functions
Solution 2.99 Let 𝑌 = 𝑋1 + 𝑋2 where 𝑋1 , 𝑋2 are i.i.d. uniform on (0,1). Then the
p.d.f.’s are
    𝑓1 (𝑥) = 𝑓2 (𝑥) = 𝐼{0 < 𝑥 < 1}
               ∫𝑦
                  d𝑥 = 𝑦,       if 0 ≤ 𝑦 < 1
              0
             
             
             
     𝑔(𝑦) =
              1 d𝑥 = 2 − 𝑦, if 1 ≤ 𝑦 ≤ 2.
             
              ∫
              𝑦−1
                                                                  ∫∞
Solution 2.100 𝑋1 , 𝑋2 are i.i.d. 𝐸 (1). 𝑈 = 𝑋1 − 𝑋2 . Pr{𝑈 ≤ 𝑢} = 0 𝑒 − 𝑥 Pr{𝑋1 ≤
𝑢 + 𝑥} d𝑥. Notice that −∞ < 𝑢 < ∞ and Pr{𝑋1 ≤ 𝑢 + 𝑥} = 0 if 𝑥 + 𝑢 < 0. Let
𝑎 + = max(𝑎, 0). Then
                 ∫       ∞
                                                   +
     Pr{𝑈 ≤ 𝑢} =      𝑒 −𝑥 (1 − 𝑒 − (𝑢+𝑥 ) ) d𝑥
                   0
                     ∫ ∞
                                        +
                 =1−       𝑒 − 𝑥− (𝑢+𝑥 ) d𝑥
                             0
                     
                     
                      1 − 12 𝑒 −𝑢 ,      if 𝑢 ≥ 0
                     
                     
                 =
                     
                      1 𝑒 − |𝑢| ,
                     
                                  if 𝑢 < 0
                     2
                                      1
     Thus, the p.d.f. of 𝑈 is 𝑔(𝑢) = 𝑒 − |𝑢| , −∞ < 𝑢 < ∞.
                                      2
2 Probability Models and Distribution Functions                                          35
                                            √
                                                                               
                                  .                                     50 − 40
Solution 2.101 𝑇 = 𝑋1 +· · ·+𝑋20 ∼ 𝑁 (20𝜇, 20𝜎 2 ). Pr{𝑇 ≤ 50} ≈ Φ                =
                                                                        44.7214
0.5885.
                                                      √︁
Solution 2.102 𝑋 ∼ 𝐵(200,  0.15). 𝜇 = 𝑛𝑝 = 30, 𝜎 = 𝑛𝑝(1 − 𝑝) = 5.0497.
                            34.5 − 30         25.5 − 30
   Pr{25 < 𝑋 < 35} ≈ Φ                  −Φ               = 0.6271.
                             5.0497            5.0497
                                                                 
                                                             9.5
Solution 2.103 𝑋 ∼ 𝑃(200). Pr{190 < 𝑋 < 210} ≈ 2Φ √                 − 1 = 0.4983.
                                                              200
                                                   3               √︁
Solution 2.104 𝑋 ∼ Beta(3, 5). 𝜇 = 𝐸 {𝑋 } =            = 0.375. 𝜎 = 𝑉 {𝑋 } =
         1/2                                     8
  3·5
               = 0.161374.
  82 · 9
                                       √              !
                                         200 · 0.2282
   Pr{| 𝑋¯ 200 − 0.375| < 0.2282} ≈ 2Φ                  − 1 = 1.
                                          0.161374
Solution 2.105 𝑡 .95 [10] = 1.8125, 𝑡 .95 [15] = 1.7531, 𝑡 .95 [20] = 1.7247.
print(stats.t.ppf(0.95, 10))
print(stats.t.ppf(0.95, 15))
print(stats.t.ppf(0.95, 20))
1.8124611228107335
1.7530503556925547
1.7247182429207857
Solution 2.106 𝐹.95 [10, 30] = 2.1646,             𝐹.95 [15, 30] = 2.0148,   𝐹.95 [20, 30] =
1.9317.
2.164579917125473
2.0148036912954885
1.931653475236928
Solution 2.107 The solution to this problem is based on the fact, which is not
discussed in the text, that 𝑡 [𝜈] is distributed like the ratio of two independent random
                         √︁                                    𝑁 (0, 1)
variables, 𝑁 (0, 1) and 𝜒2 [𝜈]/𝜈. Accordingly, 𝑡 [𝑛] ∼ √︃               , where 𝑁 (0, 1) and
                                                                  𝜒 2 [𝑛]
                                                                     𝑛
                                     (𝑁 (0, 1)) 2
𝜒2 [𝑛] are independent. 𝑡 2 [𝑛] ∼                     ∼ 𝐹 [1, 𝑛]. Thus, since Pr{𝐹 [1, 𝑛] ≤
                                         𝜒 2 [𝑛]
                                            𝑛
𝐹1− 𝛼 [1, 𝑛]} = 1 − 𝛼.
36                                                2 Probability Models and Distribution Functions
                        √︁                        √︁
                   Pr{− 𝐹1− 𝛼 [1, 𝑛] ≤ 𝑡 [𝑛] ≤ 𝐹1− 𝛼 [1, 𝑛]} = 1 − 𝛼.
                 √︁
It follows that 𝐹1− 𝛼 [1, 𝑛] = 𝑡 1− 𝛼/2 [𝑛],
                       2
    or 𝐹1− 𝛼 [1, 𝑛] = 𝑡1−     [𝑛]. (If you assign this problem, please inform the students
                          𝛼/2
of the above fact.)
Solution 2.108 A random variable 𝐹 [𝜈1 , 𝜈2 ] is distributed like the ratio of two
independent random variables 𝜒2 [𝜈1 ]/𝜈1 and 𝜒2 [𝜈2 ]/𝜈2 . Accordingly, 𝐹 [𝜈1 , 𝜈2 ] ∼
 𝜒2 [𝜈1 ]/𝜈1
             and
 𝜒2 [𝜈2 ]/𝜈2
    1 − 𝛼 = Pr{𝐹 [𝜈1 , 𝜈2 ] ≤ 𝐹1− 𝛼 [𝜈1 , 𝜈2 ]}
                (                                  )
                  𝜒12 [𝜈1 ]/𝜈1
           = Pr                ≤ 𝐹1− 𝛼 [𝜈1 , 𝜈2 ]}
                  𝜒22 [𝜈2 ]/𝜈2
                (                                  )
                  𝜒22 [𝜈2 ]/𝜈2          1
           = Pr                ≥
                  𝜒12 [𝜈1 ]/𝜈1   𝐹1− 𝛼 [𝜈1 , 𝜈2 ]
                                                
                                      1
           = Pr 𝐹 [𝜈2 , 𝜈1 ] ≥
                                𝐹1− 𝛼 [𝜈1 , 𝜈2 ]
          = Pr {𝐹 [𝜈2 , 𝜈1 ] ≥ 𝐹𝛼 [𝜈2 , 𝜈1 ]} .
                                  1
     Hence 𝐹1− 𝛼 [𝜈1 , 𝜈2 ] =               .
                              𝐹𝛼 [𝜈2 , 𝜈1 ]
                                           𝑁 (0, 1)
Solution 2.109 Using the fact that 𝑡 [𝜈] ∼ √︃       , where 𝑁 (0, 1) and 𝜒2 [𝜈] are
                                                       𝜒 2 [𝜈 ]
                                                          𝜈
independent,        (             )
                      𝑁 (0, 1)
     𝑉 {𝑡 [𝜈]} = 𝑉 √︁
                       𝜒2 [𝜈]/𝜈
                   
                      
                                           
                                             
                                                  
                                                      
                                                                          
                                                                            
                                                                             
                    
                   
                       𝑁 (0, 1)
                                      2
                                            
                                            
                                             
                                              
                                                  
                                                   
                                                      
                                                       
                                                        𝑁 (0, 1)    2
                                                                           
                                                                           
                                                                            
                                                                             
                                                                             
               = 𝐸 𝑉 √︃             𝜒 [𝜈]       + 𝑉 𝐸 √︃            𝜒 [𝜈]      .
                             2                               2
                        𝜒 𝜈[𝜈 ]                        𝜒 𝜈[𝜈 ]
                   
                                                                      
                                          
                                                  
                                                                          
                                                                            
                                                                     
                          
                                               
                                                                 
                                                                                    
                                                                                     
                             𝑁 (0, 1)                              𝑁 (0, 1)
                                                                                  
                          
                          
                                          2
                                                
                                                    𝜈            
                                                                                 2
                                                                                     
                                                                                     
     By independence, 𝑉 √︃               𝜒 [𝜈] = 2       , and 𝐸 √︃             𝜒 [𝜈] = 0.
                                 2                 𝜒 [𝜈]               2
                           𝜒 𝜈[𝜈 ]                                𝜒 𝜈[𝜈 ]
                          
                                               
                                                                 
                                                                                    
                                                                                     
                                                                                    
                                                                                
                                1                2
                                                          𝜈            𝜈    
     Thus, 𝑉 {𝑡 [𝜈]} = 𝜈𝐸             . Since 𝜒 [𝜈] ∼ 𝐺 , 2 ∼ 2𝐺 , 1 ,
                             𝜒2 [𝜈]                        2              2
                                     ∫ ∞
                       1       1  1
                   𝐸          = ·          𝑥 𝜈−2 𝑒 − 𝑥 d𝑥
                     𝜒2 [𝜈]
                                     
                               2 Γ 𝜈2 0
                                   𝜈
                                         
                               1 Γ 2 −1      1       1      1
                              = ·       = · 𝜈            =   .
                               2  Γ 𝜈2       2 2 −1 𝜈−2
2 Probability Models and Distribution Functions                                         37
                           𝜈
   Finally, 𝑉 {𝑡 [𝜈]} =       , 𝜈 > 2.
                          𝜈−2
                                   10                               2 · 102 · 11
Solution 2.110 𝐸 {𝐹 [3, 10]} =        = 1.25,     𝑉 {𝐹 [3, 10]} =                = 1.9097.
                                    8                                3 · 82 · 6
Chapter 3
Statistical Inference and Bootstrapping
import random
import math
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import pingouin as pg
import mistat
import os
os.environ['OUTDATED_IGNORE'] = '1'
Solution 3.1 By the WLLN, for any 𝜖 > 0, lim𝑛→∞ Pr{|𝑀𝑙 − 𝜇𝑙 | < 𝜖 } = 1. Hence,
𝑀𝑙 is a consistent estimator of the 𝑙-th moment.
                                                         √ 
                                    ¯                        𝑛
Solution 3.2 Using the CLT, Pr{| 𝑋𝑛 − 𝜇| < 1} ≈ 2Φ              − 1. To determine the
                                                          𝜎√ 
                                                              𝑛
sample size 𝑛 so that this probability is 0.95 we set 2Φ         − 1 = 0.95 and solve
                  √                                          𝜎
                    𝑛
for 𝑛. This gives     = 𝑧 .975 = 1.96. Thus 𝑛 ≥ 𝜎 2 (1.96) 2 = 424 for 𝜎 = 10.5.
                  𝜎
                                                 1Í
Solution 3.3 𝜉ˆ𝑝 = 𝑋¯ 𝑛 + 𝑧 𝑝 𝜎           ˆ 𝑛2 =
                              ˆ 𝑛 , where 𝜎         (𝑋𝑖 − 𝑋¯ 𝑛 ) 2 .
                                                 𝑛
Solution 3.4 Let (𝑋1 , 𝑌1 ), . . . , (𝑋𝑛 , 𝑌𝑛 ) be a random sample from a bivariate normal
distribution with density 𝑓 (𝑥, 𝑦; 𝜇, 𝜂, 𝜎𝑋 , 𝜎𝑌 , 𝜌) as given in Eq. (4.6.6). Let 𝑍𝑖 = 𝑋𝑖𝑌𝑖
for 𝑖 = 1, . . . , 𝑛. Then the first moment of 𝑍 is given by
                  𝜇1 (𝐹𝑍 ) = 𝐸 {𝑍 }
                             ∫ ∞∫      ∞
                           =               𝑥𝑦 𝑓 (𝑥, 𝑦; 𝜇, 𝜂, 𝜎𝑋 , 𝜎𝑌 , 𝜌) d𝑥 d𝑦
                               −∞    −∞
                           = 𝜇𝜂 + 𝜌𝜎𝑋 𝜎𝑌 .
                                                                                         39
40                                                           3 Statistical Inference and Bootstrapping
Using this fact, as well as the first 2 moments of 𝑋 and 𝑌 , we get the following
moment equations:
                   𝑛                                                 𝑛
               1 ∑︁                                              1 ∑︁
                     𝑋𝑖 = 𝜇                                            𝑌𝑖 = 𝜂
               𝑛 𝑖=1                                             𝑛 𝑖=1
                   𝑛                                                 𝑛
               1 ∑︁ 2                                            1 ∑︁ 2
                    𝑋 = 𝜎𝑋2 + 𝜇2                                      𝑌 = 𝜎𝑌2 + 𝜂2
               𝑛 𝑖=1 𝑖                                           𝑛 𝑖=1 𝑖
                   𝑛
               1 ∑︁
                     𝑋𝑖𝑌𝑖 = 𝜇𝜂 + 𝜌𝜎𝑋 𝜎𝑌 .
               𝑛 𝑖=1
                                ˆ 𝑛2
            𝜈ˆ1 = 𝑀1 (𝑀1 − 𝑀2 )/𝜎           and 𝜈ˆ2 = (𝑀1 − 𝑀2 ) (1 − 𝑀1 )/𝜎    ˆ 𝑛2 .
                                            !
                                        𝜎𝑖2    Í        2              𝑤𝑖
Solution 3.6 𝑉 {𝑌¯𝑤 } =               2
                            Í  𝑘                  𝑘                                Í𝑘
                               𝑖=1 𝑤 𝑖            𝑖=1 𝑤 𝑖 . Let 𝜆 𝑖 = Í 𝑘       , 𝑖=1   𝜆𝑖 = 1.
                                        𝑛𝑖                              𝑗=1 𝑤 𝑗
We find weights 𝜆𝑖 which minimize 𝑉 {𝑌¯𝑤 }, under the constraint 𝑖=1
                                                                           Í𝑘
                                                                                   𝜆𝑖 = 1. The
                                         Í 𝑘 2 𝜎𝑖2        Í𝑘         
Lagrangian is 𝐿(𝜆1 , . . . , 𝜆 𝑘 , 𝜌) = 𝑖=1 𝜆𝑖 𝑛𝑖 + 𝜌 𝑖=1 𝜆𝑖 − 1 . Differentiating with
respect to 𝜆𝑖 , we get
 𝜕                                 𝜎2                                𝜕
                                                                                                  𝑘
                                                                                                  ∑︁
     𝐿(𝜆1 , . . . , 𝜆 𝑘 , 𝜌) = 2𝜆 𝑖 𝑖 +𝜌, 𝑖 = 1, . . . , 𝑘    and       𝐿(𝜆1 , . . . , 𝜆 𝑘 , 𝜌) =     𝜆 𝑖 −1.
𝜕𝜆 𝑖                               𝑛𝑖                                𝜕𝜌                           𝑖=1
3 Statistical Inference and Bootstrapping                                                                         41
Also
                                                    𝑛                             𝑛
                                                                                                 !
                                                   ∑︁   1                          ∑︁
                    cov( 𝛽ˆ0 , 𝛽ˆ1 ) = cov                       − 𝑤 𝑖 𝑥¯ 𝑛 𝑌𝑖 ,         𝑤 𝑖𝑌𝑖
                                                   𝑖=1
                                                             𝑛                     𝑖=1
                                                𝑛                     
                                               ∑︁   1
                                       = 𝜎2                  − 𝑤 𝑖 𝑥¯ 𝑛 𝑤 𝑖
                                               𝑖=1
                                                         𝑛
                                                    𝑥¯ 𝑛
                                       = −𝜎 2            .
                                                   𝑆𝑆 𝑥
                      42                                                                    3 Statistical Inference and Bootstrapping
                                                                                   𝜎 2 𝑥¯ 𝑛
                                               𝜌 𝛽ˆ0 , 𝛽ˆ1 = −                                 1/2
                                                                                     𝑥¯ 2
                                                                                        
                                                                                 1          1
                                                                 𝜎 2 𝑆𝑆 𝑥          + 𝑛
                                                                                 𝑛 𝑆𝑆 𝑥 𝑆𝑆 𝑥
                                                                        𝑥¯ 𝑛
                                                        = −                       1/2 .
                                                                  1 Í𝑛 2
                                                                       𝑥
                                                                  𝑛 𝑖=1 𝑖
                                                              𝑝2
                                          𝑀𝑋 (𝑡) =                         ,                𝑡 < − log(1 − 𝑝).
                                                       (1 − (1 − 𝑝)𝑒 𝑡 ) 2
                                                                                                             Í𝑛
                      Let 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 be i.i.d. 𝑁 𝐵(2, 𝑝), then the m.g.f. of 𝑇𝑛 =                   𝑖=1   𝑋𝑖 is
                                                              𝑝 2𝑛
                                         𝑀𝑇𝑛 (𝑡) =                          ,                𝑡 < − log(1 − 𝑝).
                                                       (1 − (1 − 𝑝)𝑒 𝑡 ) 2𝑛
{ex:neg-binom-dist}   Thus, 𝑇𝑛 ∼ 𝑁 𝐵(2𝑛, 𝑝). According to Example 3.4, the MLE of 𝑝, based on 𝑇𝑛
                                                                    2𝑛          2
                      (which is a sufficient statistic) is 𝑝ˆ 𝑛 =         =          , where 𝑋¯ 𝑛 = 𝑇𝑛 /𝑛 is the
                                                                  𝑇𝑛 + 2𝑛   𝑋¯ 𝑛 + 2
                      sample mean.
3 Statistical Inference and Bootstrapping                                                      43
                                                                                 2(1 − 𝑝)
   (i) According to the WLLN, 𝑋¯ 𝑛 converges in probability to 𝐸 {𝑋1 } =                  .
                                                                                    𝑝
               1− 𝑝                                                       2
Substituting 2        for 𝑋¯ 𝑛 in the formula of 𝑝ˆ 𝑛 we obtain 𝑝 ∗ =            = 𝑝. This
                  𝑝                                                   2 + 2 1−𝑝𝑝
shows that the limit in probability as 𝑛 → ∞, of 𝑝ˆ 𝑛 is 𝑝.
   (ii) Substituting 𝑘 = 2𝑛 in the formulas of Example 3.4 we obtain                                 {ex:neg-binom-dist}
                                  3𝑝(1 − 𝑝)                        𝑝 2 (1 − 𝑝)
                 Bias( 𝑝ˆ 𝑛 ) ≈                 and 𝑉 { 𝑝ˆ 𝑛 } ≈               .
                                     4𝑛                                 2𝑛
Solution 3.13 The likelihood function of 𝜇 and 𝛽 is
                                    (       𝑛
                                                                            )
                              1         1 ∑︁                    𝑛
     𝐿 (𝜇, 𝛽) = 𝐼{𝑋 (1) ≥ 𝜇} 𝑛 exp −          (𝑋 (𝑖) − 𝑋 (1) ) − (𝑋 (1) − 𝜇) ,
                             𝛽          𝛽 𝑖=1                   𝛽
where 𝑋 (1) < 𝑋 (2) < · · · < 𝑋 (𝑛) are the ordered statistics.
                                                         1 Í𝑛
   (ii) Furthermore 𝐿 ∗ (𝛽) is maximized by 𝛽ˆ𝑛 =               (𝑋 (𝑖) − 𝑋 (1) ). The MLEs
                                                         𝑛 𝑖=2
                            1
are 𝜇ˆ = 𝑋 (1) , and 𝛽ˆ𝑛 =
                              Í 𝑛
                                   (𝑋 (𝑖) − 𝑋 (1) ).
                            𝑛 𝑖=2          
                                           𝛽
   (iii) 𝑋 (1) is distributed like 𝜇 + 𝐸       , with p.d.f.
                                           𝑛
                                                         𝑛
                                                     𝑛 − 𝛽 ( 𝑥− 𝜇)
                           𝑓 (1) (𝑥; 𝜇, 𝛽) = 𝐼{𝑥 ≥ 𝜇} 𝑒            .
                                                     𝛽
Thus, the joint p.d.f. of (𝑋1 , . . . , 𝑋𝑛 ) is factored to a product of the p.d.f. of 𝑋 (1) and a
function of 𝛽ˆ𝑛 , which does not depend on 𝑋 (1) (nor on 𝜇). This implies that 𝑋 (1) and
                                                  𝛽2                              1
𝛽ˆ𝑛 are independent. 𝑉 { 𝜇}
                          ˆ = 𝑉 {𝑋 (1) } = 2 . It can be shown that 𝛽ˆ𝑛 ∼ 𝐺 (𝑛 − 1, 𝛽).
                                                 𝑛                              𝑛
                          𝑛−1             1         1 2
Accordingly, 𝑉 { 𝛽ˆ𝑛 } = 2 𝛽2 =                1−       𝛽 .
                           𝑛              𝑛         𝑛
Solution 3.14 In sampling with replacement, the number of defective items in the
sample, 𝑋, has the binomial distribution 𝐵(𝑛, 𝑝). We test the hypotheses 𝐻0 : 𝑝 ≤
0.03 against 𝐻1 : 𝑝 > 0.03. 𝐻0 is rejected if 𝑋 > 𝐵 −1 (1 − 𝛼, 20, 0.03). For 𝛼 = 0.05
the rejection criterion is 𝑘 𝛼 = 𝐵 −1 (0.95, 20, 0.03) = 2. Since the number of defective
items in the sample is 𝑋 = 2, 𝐻0 is not rejected at the 𝛼 = 0.05 significance level.
44                                                       3 Statistical Inference and Bootstrapping
1.0
0.8
                0.6
        OC(p)
                0.4
0.2
                0.0
                   0.0      0.1         0.2              0.3             0.4       0.5
                                                  p
Fig. 3.1 The OC Function 𝐵(2; 30, 𝑝)                                                                 {fig:PlotOCcurveBinomial˙2˙30}
Solution 3.15 The OC function is 𝑂𝐶 ( 𝑝) = 𝐵(2; 30, 𝑝), 0 < 𝑝 < 1. A plot of this
OC function is given in Fig. 3.1.                                                                    {fig:PlotOCcurveBinomial˙2˙30}
                               𝑝1 − 𝑝0 √
                                                  √︂       
                                                      𝑝0 𝑞0
                      1−Φ √              𝑛 − 𝑧1− 𝛼            =𝛽
                                 𝑝1 𝑞1                𝑝1 𝑞1
or, equivalently,
                           𝑝1 − 𝑝0 √
                                                  √︂
                                                       𝑝0 𝑞0
                           √         𝑛 − 𝑧 1− 𝛼              = 𝑧 1−𝛽 .
                             𝑝1 𝑞1                     𝑝1 𝑞1
This gives
                             √               √
                      (𝑧 1− 𝛼 𝑝 0 𝑞 0 + 𝑧1−𝛽 𝑝 1 𝑞 1 ) 2
                      𝑛≈
                                 ( 𝑝1 − 𝑝0)2
                                  √                 √
                      (1.645) 2 ( 0.01 × 0.99 + 0.03 × 0.97) 2
                   =                                               = 494.
                                          (0.02) 2
                                                               √
For this sample size, the critical value is 𝑘 𝛼 = 𝑛𝑝 0 + 𝑧1− 𝛼 𝑛𝑝 0 𝑞 0 = 8.58. Thus, 𝐻0
is rejected if there are more than 8 “successes” in a sample of size 494.
                              𝜎
Solution 3.17 𝑋¯ 𝑛 ∼ 𝑁 (𝜇, √ ).
                               𝑛
    (i) If 𝜇 = 𝜇0 the probability that 𝑋¯ 𝑛 will be outside the control limits is
                                                  
            ¯         3𝜎             ¯          3𝜎
      Pr 𝑋𝑛 < 𝜇0 − √ + Pr 𝑋𝑛 > 𝜇0 + √ = Φ(−3) + 1 − Φ(3) = 0.0027.
                        𝑛                          𝑛
3 Statistical Inference and Bootstrapping                                                45
socell = mistat.load_data('SOCELL')
t1 = socell['t1']
pvalue 0.35
socell = mistat.load_data('SOCELL')
t2 = socell['t2']
pvalue 0.03
Solution 3.20 Let 𝑛 = 30, 𝛼 = 0.01. The 𝑂𝐶 (𝛿) function for a one-sided 𝑡-test is
                                        √                   1
                                     © 𝛿 30 − 2.462 × (1 − 232 ) ª
                     𝑂𝐶 (𝛿) = 1 − Φ                            ®
                                                6.0614 1/2       ®
                                           (1 +        )
                                     «            58             ¬
                            = 1 − Φ(5.2117𝛿 − 2.3325).
a = np.sqrt(30)
b = 2.462 * ( 1 - 1/232)
f = np.sqrt(1 + 6.0614 / 58)
OC_delta = 1 - stats.norm.cdf((a * delta - b) / f)
                                     𝛿         OC(𝛿)
                                    0.0       0.990164
                                    0.1       0.964958
                                    0.2       0.901509
                                    0.3       0.779063
                                    0.4       0.597882
                                    0.5       0.392312
                                    0.6       0.213462
                                    0.7       0.094149
                                    0.8       0.033120
                                    0.9       0.009188
                                    1.0       0.001994
Solution 3.21 Let 𝑛 = 31, 𝛼 = 0.10. The 𝑂𝐶 function for testing 𝐻0 : 𝜎 2 ≤ 𝜎02
against 𝐻1 : 𝜎 2 > 𝜎02 is
                                      (                                  )
                              2           2
                                                 𝜎02    2
                       𝑂𝐶 (𝜎 ) = Pr 𝑆 ≤                𝜒0.9 [𝑛   − 1]
                                                𝑛−1
                                      (                                  )
                                                       𝜎02
                                  = Pr 𝜒2 [30] ≤              2
                                                             𝜒0.9 [30]
                                                       𝜎2
                                                          !
                                                   2 [30]
                                       30     𝜎02 𝜒0.9
                                  =1−𝑃    − 1; 2 ·          .
                                        2     𝜎       2
      The values of 𝑂𝐶 (𝜎 2 ) for 𝜎 2 = 1, 2(0.1) are given in the following table: (Here
𝜎02   = 1.)
                                    𝜎2        OC(𝜎 2 )
                                    1.0       0.900000
                                    1.1       0.810804
                                    1.2       0.700684
                                    1.3       0.582928
                                    1.4       0.469471
                                    1.5       0.368201
                                    1.6       0.282781
                                    1.7       0.213695
                                    1.8       0.159540
                                    1.9       0.118063
                                    2.0       0.086834
3 Statistical Inference and Bootstrapping                                           47
          1.0
                                                                      OC(p)
                                                                      OCapprox(p)
          0.8
0.6
0.4
0.2
          0.0
                0.10         0.15           0.20       0.25    0.30          0.35
                                                   p
Fig. 3.2 Comparison exact to normal approximation of 𝑂𝐶 ( 𝑝)                             {fig:OC˙OCapprox}
                                   𝑝 − 𝑝0 √
                                                     √︂       
                                                         𝑝0 𝑞0
                 𝑂𝐶 ( 𝑝) = 1 − Φ √          𝑛 − 𝑧1− 𝛼            ,
                                      𝑝𝑞                  𝑝𝑞
for 𝑝 ≥ 𝑝 0 . In Fig. 3.2, we present the graph of 𝑂𝐶 ( 𝑝), for 𝑝 0 = 0.1, 𝑛 = 100,      {fig:OC˙OCapprox}
𝛼 = 0.05 both using the exact solution and the normal approximation.
n = 100
p0 = 0.1
alpha = 0.05
df = len(data) - 1
mean = np.mean(data)
sem = stats.sem(data)
(20.136889216656858, 21.38411078334315)
     Confidence Intervals
             Variable    N      Mean            StDev          SE      99.0% C.I.
                                                              Mean
             Sample      20     20.760          0.975         0.218 (20.137, 21.384)
     (ii) A 99% C.I. for 𝜎 2 is (0.468, 2.638).
0.46795850248657883
2.6380728125212016
Solution 3.26 Let (𝜇 , 𝜇¯ .99 ) be a confidence interval for 𝜇, at level 0.99. Let
                             .99
(𝜎 .99 , 𝜎¯ .99 ) be a confidence interval for 𝜎 at level 0.99. Let 𝜉 = 𝜇 + 2𝜎 .99 and
                                                                         .99
𝜉¯ = 𝜇¯ .99 + 2𝜎  ¯ .99 . Then
                        ¯ ≥ Pr{𝜇
        Pr{𝜉 ≤ 𝜇 + 2𝜎 ≤ 𝜉}                      ≤ 𝜇 ≤ 𝜇¯ .99 , 𝜎 .99 ≤ 𝜎 ≤ 𝜎
                                                                           ¯ .99 } ≥ 0.98.
                                          .99
Solution 3.27 Let 𝑋 ∼ 𝐵(𝑛, 𝜃). For 𝑋 = 17 and 𝑛 = 20, a confidence interval for 𝜃,
at level 0.95, is (0.6211, 0.9679).
alpha = 1 - 0.95
X = 17
n = 20
F1 = stats.f(2*(n-X+1), 2*X).ppf(1 - alpha/2)
F2 = stats.f(2*(X+1), 2*(n-X)).ppf(1 - alpha/2)
pL = X / (X + (n-X+1) * F1)
pU = (X+1) * F2 / (n-X + (X+1) * F2)
print(pL, pU)
0.6210731734546859 0.9679290628145363
Solution 3.28 From the data we have 𝑛 = 10 and 𝑇10 = 134. For 𝛼 = 0.05, 𝜆 𝐿 =
11.319 and 𝜆𝑈 = 15.871.
# exact solution
print(stats.chi2(2 * T_n + 2).ppf(alpha/2) / (2 * len(X)))
print(stats.chi2(2 * T_n + 2).ppf(1 - alpha/2) / (2 * len(X)))
# approximate solution
nu = 2 * T_n + 2
print((nu + stats.norm.ppf(alpha/2) * np.sqrt(2 * nu)) / (2 * len(X)))
print((nu + stats.norm.ppf(1-alpha/2) * np.sqrt(2 * nu)) / (2 * len(X)))
11.318870163746238
15.870459268116013
11.222727638613012
15.777272361386988
Solution 3.29 For 𝑛 = 20, 𝜎 = 5, 𝑌¯20 = 13.75, 𝛼 = 0.05 and 𝛽 = 0.1, the tolerance
interval is (3.33, 24.17).
yarnstrg = mistat.load_data('YARNSTRG')
n = len(yarnstrg)
Ybar = yarnstrg.mean()
S = yarnstrg.std()
0.7306652047594424 5.117020795240556
                          50                                                    3 Statistical Inference and Bootstrapping
                                   Ordered quantiles
                                                       1
                                                       1
                                                                                   R 2 = 0.911
                                                           1       0         1       2
                                                               Theoretical quantiles
{fig:qqplotISCT1Socell}   Fig. 3.3 𝑄–𝑄 plot of ISC-𝑡1 (SOCELL.csv)
       {fig:qqplotCar}    Solution 3.33 As is shown in the normal probability plots in Fig. 3.4, the hypothesis
                          of normality is not rejected in either case.
                          Solution 3.34 Frequency distribution for turn diameter:
                  3 Statistical Inference and Bootstrapping                                                                       51
                                                                          Ordered quantiles
                                      1
                                      0                                                       0
                                      1                                                       1
                                      2                     R 2 = 0.985                       2                     R 2 = 0.986
                                            2          0           2                                  2          0           2
                                             Theoretical quantiles                                     Theoretical quantiles
                  The expected frequencies were computed for 𝑁 (35.5138, 3.3208). Here 𝜒2 = 12.4,
                  d.f. = 8 and the 𝑃 value is 0.134. The differences from normal are not significant.
                                observed   expected (O-E)ˆ2/E
                      turn
                      [28, 31)        11   8.197333   0.958231
                      [31, 32)         8   6.318478   0.447500
                      [32, 33)         9   8.668695   0.012662
                      [33, 34)         6 10.869453    2.181487
                      [34, 35)        18 12.455897    2.467673
                      [35, 36)         8 13.045357    1.951317
                      [36, 37)        13 12.486789    0.021093
                      [37, 38)         6 10.923435    2.219102
                      [38, 39)         9   8.733354   0.008141
                      [39, 40)         8   6.381395   0.410550
                      [40, 44)        13   9.052637   1.721231
                      chi2-statistic of fit 12.398987638400024
52                                                         3 Statistical Inference and Bootstrapping
                                                       √
                                                                     
                                                                0.85
     𝐷 ∗109 = 0.0702. For 𝛼 = 0.05 𝑘 ∗𝛼 = 0.895/   109 − 0.01 + √       = 0.0851. The
                                                                  109
deviations from the normal distribution are not significant.
                                                     
                                            0.2
Solution 3.36 For 𝑋 ∼ 𝑃(100), 𝑛0 = 𝑃 −1 0.3     , 100 = 100 + 𝑧 .67 × 10 = 105.
random.seed(1)
car = mistat.load_data('CAR')
Ordered quantiles 1
                                                          2
                                                                                  R 2 = 0.992
                                                              2       1     0      1        2
                                                                   Theoretical quantiles
{fig:qqplotSampleMeansMPG}   Fig. 3.5 𝑄–𝑄 Plot of mean of samples from mpg data
                             S = np.std(car['mpg'], ddof=1)
                             print('standard deviation of mpg', S)
                             print('S/8', S/8)
                             S_resample = np.std(means, ddof=1)
                             print('S.E.\{X\}', S_resample)
                             random.seed(1)
                             yarnstrg = mistat.load_data('YARNSTRG')
54                                               3 Statistical Inference and Bootstrapping
mean = np.mean(yarnstrg)
outside = 0
for _ in range(500):
  sample = random.choices(yarnstrg, k=30)
  ci = confidence_interval(sample)
  if mean < ci[0] or ci[1] < mean:
    outside += 1
ci = confidence_interval(yarnstrg)
print(f' Mean: {mean}')
print(f' 2-sigma-CI: {ci[0]:.1f} - {ci[1]:.1f}')
print(f' proportion outside: {hat_alpha:.3f}')
 Mean: 2.9238429999999993
 2-sigma-CI: 2.7 - 3.1
 proportion outside: 0.068
random.seed(1)
car = mistat.load_data('CAR')
us_cars = car[car['origin'] == 1]
us_turn = list(us_cars['turn'])
sample_means = []
for _ in range(100):
  x = random.choices(us_turn, k=58)
  sample_means.append(np.mean(x))
0.23
We obtained 𝑃˜ = 0.23. The mean 𝑋¯ = 37.203 is not significantly larger than 37.
stats.binom(50, 0.03).ppf(0.95)
4.0
random.seed(1)
cyclt = mistat.load_data('CYCLT')
3 Statistical Inference and Bootstrapping                                             55
ci_mean, dist_mean = B
print(f' Mean: {np.mean(dist_mean):.3f}')
print(f' 95%-CI: {ci_mean[0]:.3f} - {ci_mean[1]:.3f}')
ci_std, dist_std = B
print(f' Mean: {np.mean(dist_std):.3f}')
print(f' 95%-CI: {ci_std[0]:.3f} - {ci_std[1]:.3f}')
  Mean: 0.652
  95%-CI: 0.550 - 0.760
  Mean: 0.370
  95%-CI: 0.340 - 0.400
cyclt = mistat.load_data('CYCLT')
ebd = {}
for quantile in (1, 2, 3):
  B = pg.compute_bootci(cyclt, func=lambda x: np.quantile(x, 0.25 * quantile),
                        n_boot=1000, confidence=0.95, return_dist=True, seed=1)
                                    56                                                     3 Statistical Inference and Bootstrapping
                                                200                      200
                                                                                                     600
{fig:histEBD˙CYCLT˙quartiles} Fig. 3.7 Histograms of EBD for quartiles for CYCLT.csv data
                                         ci, dist = B
                                         ebd[quantile] = dist
                                         print(f'Quantile {quantile}: {np.mean(dist):.3f} 95%-CI: {ci[0]:.3f} - {ci[1]:.3f}')
                                    socell = mistat.load_data('SOCELL')
                                    t1 = socell['t1']
                                    t2 = socell['t2']
                                    B = pg.compute_bootci(idx, func=sample_correlation,
                                                          n_boot=1000, confidence=0.95, return_dist=True, seed=1)
                                    ci, dist = B
                                    print(f'rho_XY: {np.mean(dist):.3f} 95%-CI: {ci[0]:.3f} - {ci[1]:.3f}')
200
150
100
50
               0
                   0.88 0.90 0.92 0.94 0.96 0.98 1.00
Fig. 3.8 Histograms of EBD for correlation for SOCELL.csv data        {fig:histEBD˙SOCELL˙correlation}
car = mistat.load_data('CAR')
mpg = car['mpg']
hp = car['hp']
idx = list(range(len(mpg)))
sample_intercept = []
sample_slope = []
for _ in range(1000):
  x = random.choices(idx, k=len(idx))
  result = stats.linregress(hp[x], mpg[x])
  sample_intercept.append(result.intercept)
  sample_slope.append(result.slope)
hm = np.mean(hp)
                             print(np.std(sample_intercept))
                             print(np.std(sample_slope))
                                (iii) The bootstrap S.E. of slope and intercept are 1.017 and 0.00747, respectively.
{sec:least-squares-single}   The standard errors of 𝑎 and 𝑏, according to the formulas of Sect. 4.3.2.1 are 0.8099
                             and 0.00619, respectively. The bootstrap estimates are quite close to the correct
                             values.
                             Solution 3.50 In Python:
                             cyclt = mistat.load_data('CYCLT')
                             X50 = np.mean(cyclt)
                             SD50 = np.std(cyclt)
                             result = stats.ttest_1samp(cyclt, 0.55)
                             print(f'Xmean_50 = {X50:.3f}')
                             print(result)
                             Xmean_50 = 0.652
                             TtestResult(statistic=1.9425149510299369, pvalue=0.057833259176805,
                             df=49)
                             p*-value: 0.024
                             almpin = mistat.load_data('ALMPIN')
                             diam1 = almpin['diam1']
                             diam2 = almpin['diam2']
10.00
9.98
9.96
9.94
9.92
9.90
                               diam1                        diam2
Fig. 3.9 Box plots of diam1 and diam2 measurements of the ALMPIN dataset         {fig:boxplot-almpin}
almpin.boxplot(column=['diam1', 'diam2'])
plt.show()
Solution 3.52 The variances were already compared in the previous exercise. To
compare the means use:
almpin = mistat.load_data('ALMPIN')
diam1 = almpin['diam1']
diam2 = almpin['diam2']
# Compare means
mean_diam1 = np.mean(diam1)
mean_diam2 = np.mean(diam2)
print(f'Mean diam1: {mean_diam1:.5f}')
print(f'Mean diam2: {mean_diam2:.5f}')
The bootstrap based p-value for the comparison of the means is:
random.seed(1)
# return studentized distance between random samples from diam1 and diam2
def stat_func():
    d1 = random.choices(diam1, k=len(diam1))
    d2 = random.choices(diam2, k=len(diam2))
    return stats.ttest_ind(d1, d2).statistic
p*-value: 0.014
The bootstrap based p-value for the comparison of the variances is:
# Step 2: compute Wi
Wi = Bt / S2
# Step 3: compute F*
FBoot = Wi.max(axis=1) / Wi.min(axis=1)
FBoot95 = np.quantile(FBoot, 0.95)
print('FBoot 95%', FBoot95)
pstar = sum(FBoot >= F0)/len(FBoot)
print(f'p*-value: {pstar}')
S2 diam1    0.000266
diam2    0.000320
dtype: float64
F0 1.2016104294478573
FBoot 95% 1.1855457165968324
p*-value: 0.04
mpg = mistat.load_data('MPG')
columns = ['origin1', 'origin2', 'origin3']
# variance for each column
S2 = mpg[columns].var(axis=0, ddof=1)
F0 = max(S2) / min(S2)
print('S2', S2)
print('F0', F0)
# Step 2: compute Wi
Wi = Bt / S2
# Step 3: compute F*
FBoot = Wi.max(axis=1) / Wi.min(axis=1)
FBoot95 = np.quantile(FBoot, 0.95)
print('FBoot 95%', FBoot95)
pstar = sum(FBoot >= F0)/len(FBoot)
print(f'p*-value: {pstar}')
S2 origin1     12.942529
origin2     6.884615
origin3    18.321321
dtype: float64
F0 2.6611975103595213
FBoot 95% 2.6925366761838987
p*-value: 0.058
mpg = mistat.load_data('MPG.csv')
samples = [mpg[key].dropna() for key in ['origin1', 'origin2', 'origin3']]
def test_statistic_F(samples):
                       62                                                             3 Statistical Inference and Bootstrapping
500
400
300
200
100
                                    0
                                         0       1       2        3         4              5       6       7
                                                                      F* values
                                                                 exc:mpg-equal-mean
{fig:mpg-equal-mean}   Fig. 3.10 Distribution of EBD for Exercise 3.54
return stats.f_oneway(*samples).statistic
                       F0 = test_statistic_F(samples)
                       Ns = 1000
                       Fstar = []
                       for _ in range(Ns):
                           Ysamples = []
                           for sample, DBi in zip(samples, DB):
                               Xstar = np.array(random.choices(sample, k=len(sample)))
                               Ysamples.append(Xstar - DBi)
                           Fs = test_statistic_F(Ysamples)
                           Fstar.append(Fs)
                       Fstar = np.array(Fstar)
                       print(f'F = {F0:.3f}')
                       print('ratio', sum(Fstar > F0)/len(Fstar))
                       ax = pd.Series(Fstar).hist(bins=14, color='grey')
                       ax.axvline(F0, color='black', lw=2)
                       ax.set_xlabel('F* values')
                       plt.show()
                       F = 6.076
                       ratio 0.003
{fig:mpg-equal-mean}      𝐹 = 6.076, 𝑃∗ = 0.003 and the hypothesis of equal means is rejected. See Fig.
                       3.10 for the calculated EBD.
np.random.seed(1)
  The tolerance intervals of the number of defective items in future batches of size
𝑁 = 50, with 𝛼 = 0.05 and 𝛽 = 0.05 are
                                           Limits
                                       𝑝 Lower Upper
                                      0.2   1    23
                                      0.1   0    17
                                     0.05   0     9
Solution 3.56 In Python:
np.random.seed(1)
oturb = mistat.load_data('OTURB.csv')
B_025 = pg.compute_bootci(oturb, func=lambda x: getQuantile(x, p=0.025),
                          n_boot=500, seed=1, return_dist=True)
B_975 = pg.compute_bootci(oturb, func=lambda x: getQuantile(x, p=0.975),
                          n_boot=500, seed=1, return_dist=True)
tol_int = [np.quantile(B_025[1], 0.025),np.quantile(B_975[1], 0.975)]
print(f'Tolerance interval ({tol_int[0]}, {tol_int[1]})')
cyclt = mistat.load_data('CYCLT.csv')
# make use of the fact that a True value is interpreted as 1 and False as 0
print('Values greater 0.7:', sum(cyclt>0.7))
   We find that in the sample of 𝑛 = 50 cycle times, there are 𝑋 = 20 values greater
than 0.7. If the hypothesis is 𝐻0 : 𝜉 .5 ≤ 0.7, the probability of observing a value
smaller than 0.7 is 𝑝 ≥ 21 . Thus, the sign test rejects 𝐻0 if 𝑋 < 𝐵 −1 (𝛼; 50, 21 ). For
𝛼 = 0.10 the critical value is 𝑘 𝛼 = 20. 𝐻0 is not rejected.
64                                                3 Statistical Inference and Bootstrapping
Solution 3.58 We apply the wilcoxon test from scipy on the differences of oelect
from 220.
oelect = mistat.load_data('OELECT.csv')
print(stats.wilcoxon(oelect-220))
WilcoxonResult(statistic=1916.0, pvalue=0.051047599707252124)
car = mistat.load_data('CAR.csv')
fourCylinder = car[car['cyl'] == 4]
uscars = fourCylinder[fourCylinder['origin'] == 1]
foreign = fourCylinder[fourCylinder['origin'] != 1]
   The original stat 3.08 is outside of the distribution of the bootstrap samples. The
difference between the means of the turn diameters is therefore significant. Foreign
cars have on the average a smaller turn diameter.
Chapter 4
Variability in Several Dimensions and
Regression Models
import random
import numpy as np
import pandas as pd
import pingouin as pg
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.stats as sms
import seaborn as sns
import matplotlib.pyplot as plt
import mistat
Solution 4.1 In Fig. 4.1, one sees that horsepower and miles per gallon are inversely   {fig:ex˙car˙pairplot}
proportional. Turn diameter seems to increase with horsepower.
car = mistat.load_data('CAR')
sns.pairplot(car[['turn', 'hp', 'mpg']])
plt.show()
Solution 4.2 The box plots in Fig. 4.2 show that cars from Asia generally have the      {fig:ex˙car˙boxplots}
smallest turn diameter. The maximal turn diameter of cars from Asia is smaller
than the median turn diameter of U.S. cars. European cars tend to have larger turn
diameter than those from Asia, but smaller than those from the U.S.
car = mistat.load_data('CAR')
ax = car.boxplot(column='turn', by='origin')
ax.set_title('')
ax.get_figure().suptitle('')
ax.set_xlabel('origin')
plt.show()
Solution 4.3 (i) The multiple box plots (see Fig. 4.3) show that the conditional        {fig:ex˙hadpas˙plot˙i}
distributions of res3 at different hybrids are different.
                                                                                  65
66                                      4 Variability in Several Dimensions and Regression Models
                42.5
                40.0
                37.5
         turn
                35.0
                32.5
                30.0
                27.5
                 250
200
                150
         hp
100
                 50
                 35
30
                 25
           mpg
20
                 15
                       30   35     40     50   100     150    200   250 15   20    25   30   35
                            turn                        hp                        mpg
           42
           40
           38
           36
           34
           32
           30
           28
                            1                          2                          3
                                                     origin
Fig. 4.2 Boxplots of turn diameter by origin for CAR dataset                                        {fig:ex˙car˙boxplots}
4 Variability in Several Dimensions and Regression Models                          67
2400
2200
Res 3 2000
1800
                 1600
                         1         2         3         4           5       6
                                            Hybrid number
Fig. 4.3 Multiple box plots of Res 3 grouped by hybrid number                            {fig:ex˙hadpas˙plot˙i}
hadpas = mistat.load_data('HADPAS')
ax = hadpas.boxplot(column='res3', by='hyb')
ax.set_title('')
ax.get_figure().suptitle('')
ax.set_xlabel('Hybrid number')
ax.set_ylabel('Res 3')
plt.show()
   (ii) The matrix plot of all the Res variables (see Fig. 4.4) reveals that Res 3 and   {fig:ex˙hadpas˙plot˙ii}
Res 7 are positively correlated. Res 20 is generally larger than the corresponding Res
14. Res 18 and Res 20 seem to be negatively associated.
Solution 4.4 The joint frequency distribution of horsepower versus miles per gallon
is
car = mistat.load_data('CAR')
binned_car = pd.DataFrame({
  'hp': pd.cut(car['hp'], bins=np.arange(50, 275, 25)),
  'mpg': pd.cut(car['mpg'], bins=np.arange(10, 40, 5)),
})
freqDist = pd.crosstab(binned_car['hp'], binned_car['mpg'])
print(freqDist)
# You can get distributions for hp and mpg by summing along an axis
print(freqDist.sum(axis=0))
print(freqDist.sum(axis=1))
mpg               (10, 15]   (15, 20]   (20, 25]   (25, 30]     (30, 35]
hp
(50, 75]                 0         1           0          4            2
(75, 100]                0         0          23         11            0
68                                                        4 Variability in Several Dimensions and Regression Models
2400
2200
res3
        2000
1800
        1600
        2200
        2000
res7
1800
1600
        1400
        2000
        1500
res18
1000
500
2500
        2000
res14
1500
        1000
        3000
2500
        2000
res20
1500
        1000
               1600 1800 2000 2200 2400 1400 1600 1800 2000 2200   500 1000 1500 2000 1000   1500 2000   2500 1000 1500 2000 2500 3000
                         res3                      res7                res18                    res14                   res20
Fig. 4.4 Scatterplot matrix Res variables for HADPAS dataset {fig:ex˙hadpas˙plot˙ii}
  (100, 125]                            0              10             11               1                 0
  (125, 150]                            0              14              3               1                 0
  (150, 175]                            0              17              0               0                 0
  (175, 200]                            1               5              0               0                 0
  (200, 225]                            1               3              0               0                 0
  (225, 250]                            0               1              0               0                 0
  mpg
  (10, 15]      2
  (15, 20]     51
  (20, 25]     37
  (25, 30]     17
  (30, 35]      2
  dtype: int64
  hp
  (50, 75]         7
  (75, 100]       34
  (100, 125]      22
  (125, 150]      18
  (150, 175]      17
4 Variability in Several Dimensions and Regression Models                           69
(175, 200]        6
(200, 225]        4
(225, 250]        1
dtype: int64
   The intervals for HP are from 50 to 250 at fixed length of 25. The intervals for
MPG are from 10 to 35 at length 5. Students may get different results by defining
the intervals differently.
Solution 4.5 The joint frequency distribution of Res 3 and Res 14 is given in the
following table:
hadpas = mistat.load_data('HADPAS')
binned_hadpas = pd.DataFrame({
  'res3': pd.cut(hadpas['res3'], bins=np.arange(1580, 2780, 200)),
  'res14': pd.cut(hadpas['res14'], bins=np.arange(900, 3000, 300)),
})
pd.crosstab(binned_hadpas['res14'], binned_hadpas['res3'])
   The intervals for Res 3 start at 1580 and end at 2580 with length of 200. The
intervals of Res 14 start at 900 and end at 2700 with length of 300.
Solution 4.6 The following is the conditional frequency distribution of Res 3, given
that Res 14 is between 1300 and 1500 ohms:
hadpas = mistat.load_data('HADPAS')
in_range = hadpas[hadpas['res14'].between(1300, 1500)]
pd.cut(in_range['res3'], bins=np.arange(1580, 2780, 200)).value_counts(sort=False)
res3
(1580, 1780]     8
(1780, 1980]    21
(1980, 2180]    25
(2180, 2380]     2
(2380, 2580]     0
Name: count, dtype: int64
Solution 4.7 Following the instructions in the question we obtained the following
results:
                        70                                 4 Variability in Several Dimensions and Regression Models
                        hadpas = mistat.load_data('HADPAS')
                        bins = [900, 1200, 1500, 1800, 2100, 3000]
                        binned_res14 = pd.cut(hadpas['res14'], bins=bins)
                        results = []
                        for group, df in hadpas.groupby(binned_res14):
                          res3 = df['res3']
                          results.append({
                             'res3': group,
                             'N': len(res3),
                             'mean': res3.mean(),
                             'std': res3.std(),
                          })
                        pd.DataFrame(results)
                        df = pd.DataFrame([
                          [10.0, 8.04, 10.0, 9.14, 10.0, 7.46, 8.0, 6.58],
                          [8.0, 6.95, 8.0, 8.14, 8.0, 6.77, 8.0, 5.76],
                          [13.0, 7.58, 13.0, 8.74, 13.0, 12.74, 8.0, 7.71],
                          [9.0, 8.81, 9.0, 8.77, 9.0, 7.11, 8.0, 8.84],
                          [11.0, 8.33, 11.0, 9.26, 11.0, 7.81, 8.0, 8.47],
                          [14.0, 9.96, 14.0, 8.10, 14.0, 8.84, 8.0, 7.04],
                          [6.0, 7.24, 6.0, 6.13, 6.0, 6.08, 8.0, 5.25],
                          [4.0, 4.26, 4.0, 3.10, 4.0, 5.39, 19.0, 12.50],
                          [12.0, 10.84, 12.0, 9.13, 12.0, 8.15, 8.0, 5.56],
                          [7.0, 4.82, 7.0, 7.26, 7.0, 6.42, 8.0, 7.91],
                          [5.0, 5.68, 5.0, 4.74, 5.0, 5.73, 8.0, 6.89],
                        ], columns=['x1', 'y1', 'x2', 'y2', 'x3', 'y3', 'x4', 'y4'])
                        results = []
                        for i in (1, 2, 3, 4):
                          x = df[f'x{i}']
                          y = df[f'y{i}']
                          model = smf.ols(formula=f'y{i} ˜ 1 + x{i}', data=df).fit()
                          results.append({
                             'Data Set': i,
                             'Intercept': model.params['Intercept'],
                             'Slope': model.params[f'x{i}'],
                             'R2': model.rsquared,
                          })
                        pd.DataFrame(results)
                           Notice the influence of the point (19,12.5) on the regression in Data Set 4. Without
                        this point the correlation between 𝑥 and 𝑦 is zero.
{fig:AnscombeQuartet}      The dataset is known as Anscombe’s quartet (see Fig. 4.5). It not only has identical
                        linear regression, but it has also identical means and variances of 𝑥 and 𝑦, and
                        4 Variability in Several Dimensions and Regression Models                   71
10 10
y1
                                                                        y2
                                        5                                       5
                                            0                    20                 0         20
                                                      x1                                 x2
                                      10                                      10
                                y3
5 y4 5
                                            0                    20                 0         20
                                                      x3                                 x4
{fig:AnscombeQuartet}   Fig. 4.5 Anscombe’s quartet
                        car = mistat.load_data('CAR')
                        car[['turn', 'hp', 'mpg']].corr()
                                 turn        hp      mpg
                        turn 1.000000 0.507610 -0.541061
                        hp   0.507610 1.000000 -0.754716
                        mpg -0.541061 -0.754716 1.000000
                                                            − 𝛽0 − 𝛽1 𝑋𝑖1 − 𝛽2 𝑋𝑖2 ) 2
                                                Í𝑛
                        Solution 4.10 𝑆𝑆𝐸 =       𝑖=1 (𝑌𝑖
                           (i)
72                                  4 Variability in Several Dimensions and Regression Models
                                  𝑛
                    𝜕            ∑︁
                        𝑆𝑆𝐸 = −2     (𝑌𝑖 − 𝛽0 − 𝛽1 𝑋𝑖1 − 𝛽2 𝑋𝑖2 )
                   𝜕 𝛽0          𝑖=1
                                  𝑛
                    𝜕            ∑︁
                        𝑆𝑆𝐸 = −2     𝑋𝑖1 (𝑌𝑖 − 𝛽0 − 𝛽1 𝑋𝑖1 − 𝛽2 𝑋𝑖2 )
                   𝜕 𝛽1          𝑖=1
                                  𝑛
                    𝜕            ∑︁
                        𝑆𝑆𝐸 = −2     𝑋𝑖2 (𝑌𝑖 − 𝛽0 − 𝛽1 𝑋𝑖1 − 𝛽2 𝑋𝑖2 ).
                   𝜕 𝛽2          𝑖=1
   Equating these partial derivatives to zero and       arranging terms, we arrive at the
following set of linear equations:
                          Í𝑛          Í𝑛                            Í𝑛
               𝑛           𝑖=1 𝑋𝑖1      𝑖=1 𝑋𝑖2 
                                                         𝛽0   𝑖=1    𝑌𝑖 
                                                          
              Í                                          Í               
               𝑛         Í𝑛     2  Í𝑛                      𝑛
                                                          𝛽1  =  𝑖=1 𝑋𝑖1𝑌𝑖 
                                                                             
               𝑖=1 𝑋𝑖1     𝑖=1 𝑋𝑖1    𝑖=1 𝑋𝑖1 𝑋𝑖2 
                                                                         
              Í                                          Í               
               𝑛 𝑋 Í𝑛 𝑋 𝑋            Í𝑛       2          𝛽2   𝑛 𝑋𝑖2𝑌𝑖 
                                         𝑖=1 𝑋𝑖2 
                                                   
               𝑖=1 𝑖2 𝑖=1 𝑖1 𝑖2                            𝑖=1            
    (ii) Let 𝑏 0 , 𝑏 1 , 𝑏 2 be the (unique) solution. From the Í
                                                                first equation we Í get, after
dividing by 𝑛, 𝑏 0 = 𝑌¯ − 𝑋¯ 1 𝑏 1 − 𝑋¯ 2 𝑏 2 , where 𝑌¯ = 𝑛1 𝑖=1 𝑛
                                                                      𝑌𝑖 , 𝑋¯ 1 = 𝑛1 𝑖=1
                                                                                      𝑛
                                                                                          𝑋𝑖1 ,
        1
𝑋¯ 2 =
          Í 𝑛
                  𝑋𝑖2 . Substituting 𝑏 0 in the second and third equations and arranging
        𝑛 𝑖=1
terms, we obtain the reduced system of equations:
                 (𝑄 1 − 𝑛 𝑋¯ 2 ) (𝑃12 − 𝑛 𝑋¯ 1 𝑋¯ 2 )  𝑏 1  𝑃1𝑦 − 𝑛 𝑋¯ 1𝑌¯ 
                              1                                            
                                                                           
                 (𝑃12 − 𝑛 𝑋¯ 1 𝑋¯ 2 ) (𝑄 2 − 𝑛 𝑋¯ 2 )  𝑏 2  𝑃2𝑦 − 𝑛 𝑋¯ 2𝑌¯ 
                                                                           
                                                 2                         
                  Í 2                Í 2              Í                          Í
Í where 𝑄 1 = 𝑋𝑖1 , 𝑄 2 = 𝑋𝑖2 , 𝑃12 = 𝑋𝑖1 𝑋𝑖2 and 𝑃1𝑦 = 𝑋𝑖1𝑌𝑖 , 𝑃2𝑦 =
  𝑋𝑖2𝑌𝑖 . Dividing both sides by (𝑛 − 1) we obtain Eq. (4.4.3), and solving we get 𝑏 1
and 𝑏 2 .
car = mistat.load_data('CAR')
car = mistat.load_data('CAR')
car = mistat.load_data('CAR')
Model mpg ˜ hp + 1:
 Intercept    30.663308
hp           -0.073611
dtype: float64
Model turn ˜ hp + 1:
 Intercept    30.281255
hp             0.041971
dtype: float64
Partial correlation -0.27945246615045016
     74                                   4 Variability in Several Dimensions and Regression Models
     Model e1 ˜ e2:
      e2   -0.251008
     dtype: float64
     Solution 4.14 The regression of MPG on HP is MPG = 30.6633 − 0.07361 HP. The
     regression of TurnD on HP is TurnD = 30.2813 + 0.041971 HP. The regression of
     the residuals 𝑒ˆ1 on 𝑒ˆ2 is 𝑒ˆ1 = −0.251 · 𝑒ˆ2 . Thus,
     almpin = mistat.load_data('ALMPIN')
     model = smf.ols('capDiam ˜ 1 + diam2 + diam3', data=almpin).fit()
     model.summary2()
        Notes:
     [1] Standard Errors assume that the covariance matrix of the errors is correctly
     specified.
     [2] The condition number is large, 8.69e+03. This might indicate that there are
     strong multicollinearity or other numerical problems. The dependence of CapDiam
     on Diam2, without Diam1 is significant. This is due to the fact that Diam1 and
4 Variability in Several Dimensions and Regression Models                            75
Diam2 are highly correlated (𝜌 = 0.957). If Diam1 is in the regression, then Diam2
does not furnish additional information on CapDiam. If Diam1 is not included then
Diam2 is very informative.
Solution 4.16 The regression of yield (𝑌𝑖𝑒𝑙𝑑) on the four variables is:
gasol = mistat.load_data('GASOL')
# rename column 'yield' to 'Yield' as 'yield' is a special keyword in Python
gasol = gasol.rename(columns={'yield': 'Yield'})
model = smf.ols(formula='Yield ˜ x1 + x2 + astm + endPt',
                data=gasol).fit()
print(model.summary2())
   (ii) 𝑅 2 = 0.962.
   (iii) The regression coefficient of 𝑥 2 is not significant.
   (iv) Running the multiple regression again, without 𝑥2 , we obtain the equation
                                              Ordered quantiles
                                                                  0
                                                                  1
                                                                                            R 2 = 0.982
                                                                  2
                                                                      2      1        0        1          2
                                                                             Theoretical quantiles
{fig:qqPlotGasolRegResid}   Fig. 4.6 𝑄–𝑄 plot of gasol regression residuals
(ii)
                                 (𝑄) = 𝐼 − (𝐻)
                             (𝑄) 2 = (𝐼 − (𝐻)) (𝐼 − (𝐻))
                                        = 𝐼 − (𝐻) − (𝐻) + (𝐻) 2
                                        = 𝐼 − (𝐻)
                                        = 𝑄.
                                  𝑠2𝑒   = y′ (𝑄) (𝑄)y/(𝑛 − 𝑘 − 1)
                                        = y′ (𝑄)y/(𝑛 − 𝑘 − 1).
                                  𝑆𝑆𝐸
Solution 4.19 1 − 𝑅 2𝑦 ( 𝑥 ) =         where 𝑆𝑆𝐸 = ê′ ê = ||ê|| 2 .
                                 𝑆𝑆𝐷 𝑦
                        𝑛           𝑛            𝑛   𝑛
                     ©∑︁           ∑︁       ª ∑︁ ∑︁
                 cov      𝛽𝑖 𝑋𝑖 ,     𝛾𝑗 𝑋𝑗® =         𝛽𝑖 𝛾 𝑗 cov(𝑋𝑖 , 𝑋 𝑗 )
                     « 𝑖=1         𝑗=1      ¬   𝑖=1 𝑗=1
                                                 𝑛 ∑︁
                                                ∑︁  𝑛
                                              =         𝛽𝑖 𝛾 𝑗 Σ| 𝑖 𝑗
                                                    𝑖=1 𝑗=1
                                                 = 𝜷′ (—
                                                       𝚺 )𝜸.
Solution 4.23 Rearrange the dataset into the format suitable for the test outlines in
the multiple linear regression section.
socell = mistat.load_data('SOCELL')
# combine the two datasets and add the additional columns z and w
socell_1 = socell[['t3', 't1']].copy()
socell_1.columns = ['t3', 't']
socell_1['z'] = 0
socell_2 = socell[['t3', 't2']].copy()
socell_2.columns = ['t3', 't']
socell_2['z'] = 1
combined = pd.concat([socell_1, socell_2])
combined['w'] = combined['z'] * combined['t']
   Neither of the 𝑧 nor 𝑤 The 𝑃-values corresponding to 𝑧 and 𝑤 are 0.128 and 0.426
respectively. Accordingly, we can conclude that the slopes and intercepts of the two
simple linear regressions given above are not significantly different. Combining the
data we have the following regression line for the combined dataset:
df = mistat.load_data('CEMENT.csv')
(a) Regression of 𝑌 on 𝑋1 is
anova = sms.anova.anova_lm(model1)
print('Analysis of Variance\n', anova)
F = anova.F['x1']
SSE_1 = anova.sum_sq['Residual']
==============================================================================
                  coef    std err          t     P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      81.4793      4.927     16.536     0.000      70.634      92.324
x1              1.8687      0.526      3.550     0.005        0.710      3.027
==============================================================================
R-sq: 0.534
Analysis of Variance
              df       sum_sq      mean_sq          F   PR(>F)
x1          1.0 1450.076328 1450.076328 12.602518 0.004552
Residual 11.0 1265.686749      115.062432       NaN       NaN
==============================================================================
                  coef    std err            t     P>|t|      [0.025       0.975]
------------------------------------------------------------------------------
Intercept      52.5773       2.286      22.998     0.000      47.483       57.671
x1              1.4683       0.121      12.105     0.000       1.198        1.739
x2              0.6623       0.046      14.442     0.000       0.560        0.764
==============================================================================
R-sq: 0.979
Analysis of Variance
              df       sum_sq        mean_sq           F       PR(>F)
x1          1.0 1450.076328 1450.076328 250.425571 2.088092e-08
x2          1.0 1207.782266 1207.782266 208.581823 5.028960e-08
Residual 10.0      57.904483       5.790448        NaN           NaN
Comparing models
    df_resid           ssr df_diff         ss_diff          F        Pr(>F)
0      11.0 1265.686749         0.0            NaN       NaN           NaN
1      10.0     57.904483       1.0 1207.782266 208.581823 5.028960e-08
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     48.1936      3.913     12.315      0.000      39.341      57.046
4 Variability in Several Dimensions and Regression Models                           81
==============================================================================
                  coef     std err            t      P>|t|       [0.025  0.975]
------------------------------------------------------------------------------
Intercept      62.4054      70.071       0.891       0.399     -99.179  223.989
x1              1.5511       0.745       2.083       0.071       -0.166   3.269
x2              0.5102       0.724       0.705       0.501       -1.159   2.179
x3              0.1019       0.755       0.135       0.896       -1.638   1.842
x4             -0.1441       0.709      -0.203       0.844       -1.779   1.491
==============================================================================
R-sq: 0.982
Analysis of Variance
             df        sum_sq       mean_sq             F        PR(>F)
x1        1.0 1450.076328 1450.076328 242.367918 2.887559e-07
x2        1.0 1207.782266 1207.782266 201.870528 5.863323e-07
x3        1.0      9.793869       9.793869     1.636962 2.366003e-01
x4        1.0      0.246975       0.246975     0.041280 8.440715e-01
Residual 8.0      47.863639       5.982955          NaN            NaN
Comparing models
    df_resid         ssr df_diff      ss_diff         F    Pr(>F)
0       9.0 48.110614         0.0        NaN       NaN       NaN
1       8.0 47.863639         1.0 0.246975 0.04128 0.844071
outcome = 'y'
all_vars = ['x1', 'x2', 'x3', 'x4']
Final model
y ˜ 1 + x4 + x1 + x2
Intercept    71.648307
x4           -0.236540
x1             1.451938
x2             0.416110
dtype: float64
car = mistat.load_data('CAR')
car_3 = car[car['origin'] == 3]
print('Full dataset shape', car.shape)
print('Origin 3 dataset shape', car_3.shape)
model = smf.ols(formula='mpg ˜ hp + 1', data=car_3).fit()
print(model.summary2())
influence = model.get_influence()
df = pd.DataFrame({
  'hp': car_3['hp'],
  'mpg': car_3['mpg'],
  'resi': model.resid,
  'sres': influence.resid_studentized_internal,
  'hi': influence.hat_matrix_diag,
  'D': influence.cooks_distance[0],
})
print(df.round(4))
   Notice that points 51 and 94 have residuals with large magnitude (-10.4, 7.4).
Points 51 and 94 have also the largest Cook’s distance (0.58, 0.21) Points 100 and
102 have high HI values (leverage; 0.18, 0.22).
np.random.seed(1)
settings = {'s': 0.005, 'v0': 0.002, 'k': 1000, 'p0': 90_000,
                                  84                                        4 Variability in Several Dimensions and Regression Models
0.07
0.06
0.05
seconds 0.04
0.03
0.02
0.01
                                                            30         35      40        45        50       55        60
                                                                                         m
                                  group_std = results.groupby('m').std()
                                  pooled_std = np.sqrt(np.sum(group_std**2) / len(group_std))[0]
                                  print('Pooled standard deviation', pooled_std)
                                  group_mean = results.groupby('m').mean()
                                  ax = results.plot.scatter(x='m', y='seconds', color='black')
                                  ax.errorbar(group_mean.index, results.groupby('m').mean().values.flatten(),
                                              yerr=[pooled_std] * 4, color='grey')
                                  plt.show()
                                     We see that the differences between the sample means are not significant in spite
                                  of the apparent upward trend in cycle times.
{fig:boxplotIntegratedCircuits}   Solution 4.28 Prepare dataset and visualize distributions (see Fig. 4.8).
4 Variability in Several Dimensions and Regression Models                               85
3.0
2.8
2.6
2.4
2.2
2.0
               1.8
                            Exp. 1              Exp. 2               Exp. 3
Fig. 4.8 Box plot of pre-etch line width from integrated circuits fabrication process {fig:boxplotIntegratedCircuits}
df = pd.DataFrame([
  [2.58, 2.62, 2.22],
  [2.48, 2.77, 1.73],
  [2.52, 2.69, 2.00],
  [2.50, 2.80, 1.86],
  [2.53, 2.87, 2.04],
  [2.46, 2.67, 2.15],
  [2.52, 2.71, 2.18],
  [2.49, 2.77, 1.86],
  [2.58, 2.87, 1.84],
  [2.51, 2.97, 1.86]
], columns=['Exp. 1', 'Exp. 2', 'Exp. 3'])
df.boxplot()
experiment = df['Experiment']
mu = df['mu']
def onewayTest(x, verbose=False):
    df = pd.DataFrame({
        'value': x,
        'variable': experiment,
    })
                          86                                     4 Variability in Several Dimensions and Regression Models
                                         107.5
                                         105.0
                                         102.5
                                         100.0
                                          97.5
                                          95.0
                                          92.5
Batch A Batch B
                          Bt0 = onewayTest(mu)
                          print('Bt0', Bt0)
                          print('ratio', sum(B[1] >= Bt0)/len(B[1]))
                          Bt0 120.91709844559576
                          ratio 0.0
The bootstrap test also shows that the difference in means is significant.
{fig:filmSpeedData} Solution 4.29 Create dataset and visualize distribution (see Fig. 4.9).
                          df = pd.DataFrame({
                            'Batch A': [103, 107, 104, 102, 95, 91, 107, 99, 105, 105],
                            'Batch B': [104, 103, 106, 103, 107, 108, 104, 105, 105, 97],
                          })
                          fig, ax = plt.subplots()
                          bplot1 = ax.boxplot(df, labels=df.columns)
                          plt.show()
    The randomization test gave a P value of 0.106. The difference between the means
is not significant.
    (ii)
   The ANOVA also shows no significant difference in the means. The 𝑃 value
is 0.228. Remember that the 𝐹-test in the ANOVA is based on the assumption of
normality and equal variances. The randomization test is nonparametric.
Solution 4.30 Define function that calculates the statistic and execute bootstrap.
def func_stats(x):
    m = pd.Series(x).groupby(df['Experiment']).agg(['mean', 'count'])
    top = np.sum(m['count'] * m['mean'] ** 2) - len(x)*np.mean(x)**2
    return top / np.std(x) ** 2
Bt = []
mu = list(df['mu'])
for _ in range(1000):
    mu_star = random.sample(mu, len(mu))
    Bt.append(func_stats(mu_star))
Bt0 = func_stats(mu)
print('Bt0', Bt0)
print('ratio', sum(Bt >= Bt0)/len(Bt))
Bt0 26.986990459670288
ratio 0.0
The result demonstrates that the differences between the results is significant.
Solution 4.31 Load the data and visualize the distributions (see Fig. 4.10). {fig:boxplotXdevCrcBrdPlace}
place = mistat.load_data('PLACE')
place.boxplot('xDev', by='crcBrd')
plt.show()
                                    Boxplot grouped
                                              xDev by crcBrd
            0.004
            0.003
            0.002
            0.001
            0.000
            0.001
            0.002
            0.003
                    1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526
                                            crcBrd
Fig. 4.10 Box plot visualisation of xDev distribution by crcBrd for the PLACE dataset           {fig:boxplotXdevCrcBrdPlace}
G1 = [1, 2, 3, 4, 5, 6, 7, 8, 9]
G2 = [10, 11, 12]
G3 = [13, 14, 15, 16, 17, 18, 19, 21, 22, 23, 24, 25, 26]
G4 = [20]
place['group'] = 'G1'
place.loc[place['crcBrd'].isin(G2), 'group'] = 'G2'
place.loc[place['crcBrd'].isin(G3), 'group'] = 'G3'
place.loc[place['crcBrd'].isin(G4), 'group'] = 'G4'
sem = statistics['sem'].values
sem = sem**2
sem = np.sqrt(sem[:-1] + sem[1:])
4 Variability in Several Dimensions and Regression Models                 89
print(sem * 6.193)
print(757/644, 467/387, 486/459)
The differences between the means of the groups are all significant.
Solution
df        4.32
   = pd.DataFrame({
     'US': [33, 25],
     'Europe': [7, 7],
     'Asia': [26, 11],
  })
print(df)
col_sums = df.sum(axis=0)
row_sums = df.sum(axis=1)
total = df.to_numpy().sum()
   US Europe Asia
0 33        7  26
1 25        7  11
chi2: 2.440
p-value: 0.295
90                               4 Variability in Several Dimensions and Regression Models
   The chi-square test statistic is 𝑋 2 = 2.440 with d.f. = 2 and 𝑃 value = 0.295. The
null hypothesis that the number of cylinders a car has is independent of the origin of
the car is not rejected.
   We can also use the scipy function chi2 contingency.
chi2 = stats.chi2_contingency(df)
print(f'chi2-statistic: {chi2[0]:.3f}')
print(f'p-value: {chi2[1]:.3f}')
print(f'd.f.: {chi2[2]}')
chi2-statistic: 2.440
p-value: 0.295
d.f.: 2
car = mistat.load_data('CAR')
binned_car = pd.DataFrame({
  'turn': pd.cut(car['turn'], bins=[27, 30.6, 34.2, 37.8, 45]), #np.arange(27, 50, 3.6)),
  'mpg': pd.cut(car['mpg'], bins=[12, 18, 24, 100]),
})
freqDist = pd.crosstab(binned_car['mpg'], binned_car['turn'])
print(freqDist)
chi2 = stats.chi2_contingency(freqDist)
print(f'chi2-statistic: {chi2[0]:.3f}')
print(f'p-value: {chi2[1]:.3f}')
print(f'd.f.: {chi2[2]}')
The dependence between turn diameter and miles per gallon is significant.
question_13 = pd.DataFrame({
  '1': [0,0,0,1,0],
  '2': [1,0,2,0,0],
  '3': [1,2,6,5,1],
  '4': [2,1,10,23,13],
  '5': [0,1,1,15,100],
  }, index = ['1', '2', '3', '4', '5']).transpose()
question_23 = pd.DataFrame({
  '1': [1,0,0,3,1],
  '2': [2,0,1,0,0],
  '3': [0,4,2,3,0],
  '4': [1,1,10,7,5],
  '5': [0,0,1,30,134],
  }, index = ['1', '2', '3', '4', '5']).transpose()
chi2_13 = stats.chi2_contingency(question_13)
chi2_23 = stats.chi2_contingency(question_23)
4 Variability in Several Dimensions and Regression Models   91
print('Question 1 vs 3')
print(f' Mean squared contingency : {msc_13:.3f}')
print(f' Tschuprov : {tschuprov_13:.3f}')
print(f" Cramer's index : {cramer_13:.3f}")
print('Question 2 vs 3')
print(f' Mean squared contingency : {msc_23:.3f}')
print(f' Tschuprov : {tschuprov_23:.3f}')
print(f" Cramer's index : {cramer_23:.3f}")
Question 1 vs 3
  Mean squared contingency : 0.629
  Tschuprov : 0.397
  Cramer's index : 0.561
Question 2 vs 3
  Mean squared contingency : 1.137
  Tschuprov : 0.533
  Cramer's index : 0.754
Chapter 5
Sampling for Estimation of Finite Population
Quantities
import random
import numpy as np
import pandas as pd
import pingouin as pg
from scipy import stats
import matplotlib.pyplot as plt
import mistat
                               √
                                                                  
                                                                 𝛿
                            Pr{ 𝑛| 𝑋¯ 𝑛 − 𝜇 𝑁 | < 𝛿} ≈ 2Φ            − 1,
                                                                𝜎𝑁
as 𝑛 → ∞.
                                                                                           93
                        94                                    5 Sampling for Estimation of Finite Population Quantities
                                                 RSWR                                          RSWOR
                                    12.5
                                                                                      15
                                    10.0
                        Frequency
                                                                          Frequency
                                     7.5                                              10
                                     5.0
                                                                                      5
                                     2.5
                                     0.0                                              0
                                           0.0        0.5                                  0.0        0.5
                                           Sample correlation                              Sample correlation
                        Fig. 5.1 Distribution of correlation between xDev and yDev for sampling with and without distri-
{fig:exDistCorrPlace}   bution.
                        random.seed(1)
                        place = mistat.load_data('PLACE')
                        rswr = []
                        rswor = []
                        idx = list(range(len(place)))
                        for _ in range(100):
                            rswr.append(stat_func(random.choices(idx, k=20)))
                            rswor.append(stat_func(random.sample(idx, k=20)))
                                    70                                      80                                       60
                                    60
                                                                            60                                       50
                                    50
                        Frequency
Frequency
                                                                                                         Frequency
                                    40                                                                               40
                                                                            40                                       30
                                    30
                                    20                                                                               20
                                                                            20
                                    10                                                                               10
                                     0                                      0                                         0
                                         35        36     37                     100      120      140                    20        22
                                              Median turn                              Median hp                           Median mpg
                        Fig. 5.2 Distribution of median of turn-diameter, horsepower, and mpg of the CAR dataset using
                        random sampling without replacement.                                                                                  {fig:exDistMedianCAR}
{fig:exDistMedianCAR} Solution 5.5 The Python code creates the histograms shown in Fig. 5.2.
                        random.seed(1)
                        car = mistat.load_data('CAR')
                        columns = ['turn', 'hp', 'mpg']
                        idx = list(range(len(car)))
                        result = []
                        for _ in range(200):
                            result.append(stat_func(random.sample(idx, k=50)))
                        result = pd.DataFrame(result)
size is 𝑛 = 200 and the weights are 𝑊1 = 0.385, 𝑊2 = 0.115, 𝑊3 = 0.5. Thus,
𝑛1 = 77, 𝑛2 = 23, and 𝑛3 = 100.
sample_means = []
for _ in range(500):
    m_1 = np.mean(random.sample(strata_1, k=n_1))
    m_2 = np.mean(random.sample(strata_2, k=n_2))
    m_3 = np.mean(random.sample(strata_3, k=n_3))
    sample_means.append(w_1*m_1 + w_2*m_2 + w_3*m_3)
std_dev_sample_means = np.std(sample_means)
print(std_dev_sample_means)
print(stats.sem(place['xDev'], ddof=0))
 3.442839155174113e-05
 8.377967188860638e-05
                                 𝑊𝑖2 𝜎
                                     ˜𝑁2
                                         𝑖
                                              = 𝜆,     𝑖 = 1, . . . , 𝑘
                                    𝑛2𝑖
                                    𝑘
                                   ∑︁
                                          𝑛𝑖 = 𝑛.
                                   𝑖=1
                        1                                         1 Í𝑘
Equivalently, 𝑛𝑖 = √ 𝑊𝑖 𝜎    ˜ 𝑁𝑖 , for 𝑖 = 1, . . . , 𝑘 and 𝑛 = √          ˜ 𝑁𝑖 . Thus
                                                                     𝑖=1 𝑊𝑖 𝜎
                         𝜆                                         𝜆
              ˜ 𝑁𝑖
           𝑊𝑖 𝜎
𝑛0𝑖 = 𝑛 Í 𝑘            , 𝑖 = 1, . . . , 𝑘.
                  ˜ 𝑁𝑗
          𝑗=1 𝑊 𝑗 𝜎
              
              
               1,   if 𝑖-th population element is sampled
              
              
where 𝐼𝑖 =
               
                0, otherwise.
               
               
   Notice that 𝐼1 , . . . , 𝐼 𝑁 are independent of 𝑒 1 , . . . , 𝑒 𝑁 . Hence, 𝐸 {𝐼𝑖 𝑒 𝑖 } = 0 for all
𝑖 = 1, . . . , 𝑁. This proves that 𝑌¯𝑛 is prediction unbiased, irrespective of the sample
strategy. The prediction MSE of 𝑌¯𝑛 is
                                      𝜎2             𝑛 2 1      𝑛 2
                     𝑃𝑀𝑆𝐸 {𝑌¯𝑛 | s} =                1−     +   1−   𝜎
                                      𝑛               𝑁       𝑁    𝑁
                                      𝜎2             𝑛 
                                    =              1−     .
                                      𝑛               𝑁
Notice that 𝑃𝑀𝑆𝐸 {𝑌¯𝑛 | s} is independent of s, and is equal for all samples.
                         
                         
                          1,       if 𝑖-th population element is in the sample s
                         
                         
                  𝐼𝑖 =
                         
                          0,
                         
                                    otherwise.
                         
     Recall that 𝑦 𝑖 = 𝛽𝑥𝑖 + 𝑒 𝑖 , 𝑖 = 1, . . . , 𝑁, and that Í     for any sampling strategy,
                                                                      𝑁
𝑒 1 , . . . , 𝑒 𝑁 are independent of 𝐼1 , . . . , 𝐼 𝑁 . Hence, since 𝑖=1 𝐼𝑖 = 𝑛,
                                                      𝑁                
                                   ˆ               1 ∑︁       𝛽𝑥𝑖 + 𝑒 𝑖
                                𝐸 {𝑌𝑅 𝐴 } = 𝑥¯ 𝑁 ·       𝐸 𝐼𝑖
                                                   𝑛 𝑖=1         𝑥𝑖
                                                          𝑁         !
                                                       1 ∑︁       𝑒𝑖
                                          = 𝑥¯ 𝑁 𝛽 +         𝐸 𝐼𝑖
                                                       𝑛 𝑖=1      𝑥𝑖
                                              = 𝑥¯ 𝑁 𝛽,
                                  
                𝑒𝑖        𝐼𝑖
because 𝐸 𝐼𝑖        =𝐸        𝐸 {𝑒 𝑖 } = 0, 𝑖 = 1, . . . , 𝑁. Thus, 𝐸 {𝑌ˆ𝑅 𝐴 − 𝜇 𝑁 } = 0 and
                𝑥𝑖        𝑥𝑖
𝑌ˆ𝑅 𝐴 is an unbiased predictor.
                                              𝑁                 𝑁
                                                                         !2
                                        𝑥¯ 𝑁 ∑︁
                                       
                                                     𝑒      1 ∑︁          
                                                                           
                                                                           
                     𝑃𝑀𝑆𝐸 {𝑌ˆ𝑅 𝐴 } = 𝐸
                                                        𝑖
                                                    𝐼𝑖 −              𝑒𝑖
                                        𝑛            𝑥 𝑖 𝑁 𝑖=1            
                                             𝑖=1                          
                                       
                                       (                                   )
                                         ∑︁  𝑥¯ 𝑁      1
                                                                 ∑︁      𝑒𝑖′
                                   =𝑉               −       𝑒𝑖 −
                                         𝑖 ∈s
                                              𝑛𝑥  𝑖    𝑁         𝑖 ′ ∈r
                                                                          𝑁
                                                      𝑛                                𝑛
where s𝑛 is the set of elements in the sample and r𝑛 is the set of elements in P but
not in s𝑛 , r𝑛 = P − s𝑛 . Since 𝑒 1 , . . . , 𝑒 𝑁 are uncorrelated,
                                                  ∑︁  𝑥¯ 𝑁       2
                                                               1            𝑁 −𝑛
                  𝑃𝑀𝑆𝐸 {𝑌ˆ𝑅 𝐴 | s𝑛 } = 𝜎 2                   −       + 𝜎2
                                                  𝑖 ∈s𝑛
                                                        𝑛𝑥 𝑖   𝑁             𝑁2
                                                    "                           2#
                                               𝜎2              ∑︁  𝑁 𝑥¯ 𝑁
                                              = 2 (𝑁 − 𝑛) +                  −1     .
                                               𝑁               𝑖 ∈s
                                                                       𝑛𝑥 𝑖
                                                                             𝑛
                                             2
A sample s𝑛 which minimizes 𝑖 ∈s𝑛 𝑁𝑛𝑥𝑥¯ 𝑁𝑖 − 1 is optimal.
                              Í
import math
import mistat
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa import tsatools
import statsmodels.formula.api as smf
Solution 6.2 (i) Fig. 6.1 shows the change of demand over time                            {fig:seascom-timeline}
   (ii) We are first fitting the seasonal trend to the data in SeasCom data set. We use
the linear model𝑌 = 𝑋 𝛽 + 𝜀, where the 𝑌 vector has 102 elements, which are in data
set. 𝑋 is a 102x4 matrix of four column vectors. We combined 𝑋 and 𝑌 in a data
frame. In addition to the SeasCom data, we add a column of 1’s (const), a column
with numbers 1 to 102 (trend) and two columns to describe the seasonal pattern.
The column season 1 consists of cos(𝜋 × trend/6), and the column season 2 is
sin(𝜋 × trend/6).
seascom = mistat.load_data('SEASCOM.csv')
df = tsatools.add_trend(seascom, trend='ct')
df['season_1'] = [np.cos(math.pi * tx/6) for tx in df['trend']]
df['season_2'] = [np.sin(math.pi * tx/6) for tx in df['trend']]
print(df.head())
                                                                                   101
                                102                                                      6 Time Series Analysis and Prediction
                                       160
                                       140
                                       120
                                Data
                                       100
                                       80
                                       60
                                       40
                                              0            20             40              60            80           100
                                                                                Time
                                seascom = mistat.load_data('SEASCOM.csv')
                                fig, ax = plt.subplots()
                                ax.scatter(seascom.index, seascom, facecolors='none', edgecolors='grey')
                                model.predict(df).plot(ax=ax, color='black')
                                ax.set_xlabel('Time')
                                ax.set_ylabel('Data')
                                plt.show()
{fig:seascom-model-deviation} (iii) Calculate the residuals and plot them (see Fig. 6.2).
                                U = df['SeasCom'] - model.predict(df)
                                fig, ax = plt.subplots()
                                ax.scatter(U.index, U, facecolors='none', edgecolors='black')
6 Time Series Analysis and Prediction                                             103
             7.5
             5.0
             2.5
Deviation
             0.0
             2.5
             5.0
             7.5
            10.0
                   0           20           40           60     80          100
                                                  Time
ax.set_xlabel('Time')
ax.set_ylabel('Deviation')
plt.show()
  Corr(Ut,Ut-1) = -0.191
  Corr(Ut,Ut-2) = 0.132
   Indeed the correlations between adjacent data points are 𝑐𝑜𝑟𝑟 (𝑋𝑡 , 𝑋𝑡 −1 ) =
−0.191, and 𝑐𝑜𝑟𝑟 (𝑋𝑡 , 𝑋𝑡 −2 ) = 0.132.
   (iv) A plot of the deviations and the low autocorrelations shows something like
randomness.
Solution 6.3 According to Equation 6.2.2, the auto-correlation in a stationary MA(q)    {eqn:ma-covariance}
is
104                                                        6 Time Series Analysis and Prediction
                                                  Í𝑞−ℎ
                                      𝐾 (ℎ)           𝑗=0 𝛽 𝑗 𝛽 𝑗+ℎ
                               𝜌(ℎ) =       =          Í𝑞      2
                                      𝐾 (0)              𝑗=0 𝛽 𝑗
for 0 ≤ ℎ ≤ 𝑞).
   Notice that 𝜌(ℎ) = 𝜌(−ℎ), and 𝜌(ℎ) = 0 for all ℎ > 𝑞.
Solution 6.4 In Python
                                   0       1      2        3      4      5
                        K(h) 3.308 1.672 0.542 0.541 1.028 0.550
                        rho(h) 1.000 0.506 0.164 0.163 0.311 0.166
Solution 6.5 We consider the 𝐴𝑄(∞), given by coefficients 𝛽 𝑗 = 𝑞 𝑗 , with 0 < 𝑞 < 1.
In this case,
   (i) 𝐸 {𝑋𝑡 } = 0, and
   (ii) 𝑉 {𝑋𝑡 } = 𝜎 2 ∞       2 𝑗 = 𝜎 2 /(1 − 𝑞 2 ).
                     Í
                        𝑗=0 𝑞
stationary.
   (ii) 𝐸 {𝑋𝑡 } = 0
   (iii) According to the Yule-Walker equations,
                                                    2
                                 1 −0.75 𝐾 (0)            𝜎
                                                       =
                               −0.75 1         𝐾 (1)      0
      𝑋𝑡 = ( 𝐴2 (𝑧)) −1 𝜀 𝑡
         = 𝜀 𝑡 + 0.5𝜀 𝑡 −1 − 0.08𝜀 𝑡 −2 − 0.205𝜀 𝑡 −3 − 0.0761𝜀 𝑡 −4 + 0.0296𝜀 𝑡 −5 + . . .
6 Time Series Analysis and Prediction                                               105
Accordingly,
Solution 6.11 Let 𝑋 denote the DOW1941 data set. We create a new set, 𝑌 of second
order difference, i.e., 𝑌𝑡 = 𝑋𝑡 − 2𝑋𝑡 −1 + 𝑋𝑡 −2 .
dow1941 = mistat.load_data('DOW1941.csv')
   (i) In the following table we present the autocorrelations, acf, and the partial
autocorrelations, pacf, of 𝑌 . For a visualisation see Fig. 6.3.                          {fig:acf-pacf-dow-second-order}
                                    Fig. 6.3 Autocorrelations, acf, and the partial autocorrelations, pacf, of the second order differences
                                    of the DOW1941 dataset.                                                                                   {fig:acf-pacf-dow-second-order}
{fig:seascom-one-day-ahead-model}   Solution 6.12 In Fig. 6.4, we present the seasonal data SeasCom, and the one-day
                                    ahead predictions, We see an excellent prediction.
                                    predictedError = mistat.optimalLinearPredictor(seascom_model.resid,
                                                          10, nlags=9)
                                    predictedTrend = seascom_model.predict(seascom_df)
                                    correctedTrend = predictedTrend + predictedError
                                    fig, ax = plt.subplots()
                                    ax.scatter(seascom_df.index, seascom_df['SeasCom'],
                                               facecolors='none', edgecolors='grey')
                                    predictedTrend.plot(ax=ax, color='grey')
                                    correctedTrend.plot(ax=ax, color='black')
                                    ax.set_xlabel('Time')
                                    ax.set_ylabel('SeasCom data')
                                    plt.show()
                                    6 Time Series Analysis and Prediction                                    107
                                                   160
                                                   140
                                                   120
                                    SeasCom data
                                                   100
                                                   80
                                                   60
                                                   40
                                                         0     20            40             60    80   100
                                                                                   Time
{fig:seascom-one-day-ahead-model}   Fig. 6.4 One-day ahead prediction model of SeasCom data set
Chapter 7
Modern analytic methods: Part I
import warnings
import mistat
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline
Solution 7.2 The decision tree model for testResult results in the confusion ma-
trix shown in Fig. 7.1.                                                            {fig:ex-confusion-matrix-testResult}
sensors = mistat.load_data('SENSORS.csv')
predictors = [c for c in sensors.columns if c.startswith('sensor')]
outcome = 'testResult'
X = sensors[predictors]
                                                                            109
                                       110                                                                               7 Modern analytic methods: Part I
                                                                                                                                                         80
                                                                      Brake      4      0     0      0     0      1     1      0     0
Good 0 82 0 0 0 0 0 0 0 70
Grippers 0 0 14 0 0 0 0 0 0 60
                                                                         IMP     0      1     1      3     0      0     0      0     0                   50
                                                True label
                                                                         ITM     0      2     0      0     31     0     0      0     0                   40
                                                                      Motor      0      0     0      0     0     15     1      0     0
                                                                                                                                                         30
                                                                        SOS      0      0     0      0     0      0     2      0     0
                                                                                                                                                         20
                                                             Velocity Type I     0      0     0      0     0      0     0     11     0
                                                                                                                                                         10
                                                             Velocity Type II    0      1     0      0     0      0     0      0     4
                                                                                                                                                         0
                                                                                Brake Good Grippers IMP ITM Motor SOS VelocityVelocity
                                                                                                                               Type I Type II
                                                                                                      Predicted label
                                                                                                                                      exc:sensors-test-result
{fig:ex-confusion-matrix-testResult}   Fig. 7.1 Decision tree model to predict testResult from sensor data (Exc. 7.2)
y = sensors[outcome]
plt.show()
                                          The models classification performance is very good. The test result ‘Good’, which
                                       corresponds to status ‘Pass’ is correctly predicted. Most of the other individual test
                                       results have also low missclassification rates. The likely reason for this is that each
                                       test result is associated with a specific subset of the sensors.
                                       Solution 7.3 In Python
                                        Confusion matrix
                                        [[92 0]
                                         [ 0 82]]
7 Modern analytic methods: Part I                                              111
var_imp = pd.DataFrame({
  'importance': xgb.feature_importances_,
  }, index=predictors)
var_imp = var_imp.sort_values(by='importance', ascending=False)
var_imp['order'] = range(1, len(var_imp) + 1)
print(var_imp.head(10))
var_imp.loc[var_imp.index.isin(['sensor18', 'sensor07', 'sensor21'])]
            importance    order
sensor18      0.290473        1
sensor54      0.288680        2
sensor53      0.105831        3
sensor55      0.062423        4
sensor61      0.058112        5
sensor48      0.040433        6
sensor07      0.026944        7
sensor12      0.015288        8
sensor03      0.013340        9
sensor52      0.013160       10
            importance    order
sensor18      0.290473        1
sensor07      0.026944        7
sensor21      0.000000       50
   The decision tree model uses sensors 18, 7, and 21. The xgboost model identifies
sensor 18 as the most important variable. Sensor 7 is ranked 7th, sensor 21 has no
importance.
y = sensors['status']
Confusion matrix
[[92 0]
 [ 0 82]]
var_imp = pd.DataFrame({
  'importance': model.feature_importances_,
  }, index=predictors)
var_imp = var_imp.sort_values(by='importance', ascending=False)
var_imp['order'] = range(1, len(var_imp) + 1)
print(var_imp.head(10))
var_imp.loc[var_imp.index.isin(['sensor18', 'sensor07', 'sensor21'])]
112                                               7 Modern analytic methods: Part I
           importance     order
sensor61     0.138663         1
sensor18     0.100477         2
sensor53     0.079890         3
sensor52     0.076854         4
sensor46     0.052957         5
sensor50     0.051970         6
sensor44     0.042771         7
sensor48     0.037087         8
sensor24     0.036825         9
sensor21     0.035014        10
           importance     order
sensor18     0.100477         2
sensor21     0.035014        10
sensor07     0.023162        17
   The decision tree model uses sensors 18, 7, and 21. Sensor 18 has the second
largest importance value, sensor 21 ranks 10th, and sensor 7 is on rank 17.
   The accuracy for predicting the validation set is very good for all three models,
with random forest giving the best model with an accuracy of 0.986. The xgboost
model has a slightly lower accuracy of 0.957 and the accuracy for the decision tree
model is 0.900.
   If you change the random seeds or remove them for the various commands, you
will see that the accuracies vary and that the order can change.
dt_model = DecisionTreeClassifier(ccp_alpha=0.012)
random_valid_acc = []
random_train_acc = []
org_valid_acc = []
org_train_acc = []
for _ in range(100):
  train_X, valid_X, train_y, valid_y = train_test_split(X, y,
    test_size=0.4)
  dt_model.fit(train_X, train_y)
  org_train_acc.append(accuracy_score(train_y, dt_model.predict(train_X)))
  org_valid_acc.append(accuracy_score(valid_y, dt_model.predict(valid_X)))
ax = pd.Series(random_valid_acc).plot.density(color='C0')
pd.Series(random_train_acc).plot.density(color='C0', linestyle='--',
                                      ax=ax)
pd.Series(org_valid_acc).plot.density(color='C1', ax=ax)
pd.Series(org_train_acc).plot.hist(color='C1', linestyle='--',
                                   ax=ax)
ax.set_ylim(0, 40)
ax.set_xlim(0, 1.01)
plt.show()
                                   114                                                           7 Modern analytic methods: Part I
                                               40
                                               35
                                               30
                                               25
                                   Frequency
                                               20
                                               15
                                               10
                                               5
                                               0
                                                0.0         0.2             0.4            0.6              0.8             1.0
                                   Solution 7.7 Load data
                                   distTower = mistat.load_data('DISTILLATION-TOWER.csv')
                                   predictors = ['Temp1', 'Temp2', 'Temp3', 'Temp4', 'Temp5', 'Temp6',
                                                 'Temp7', 'Temp8', 'Temp9', 'Temp10', 'Temp11', 'Temp12']
                                   outcome = 'VapourPressure'
                                   Xr = distTower[predictors]
                                   yr = distTower[outcome]
                                      (ii) Determine model performance for different tree complexity along the depen-
        {fig:exc-cpp-pruning}      dence of tree depth on ccp parameter; see Fig. 7.2.
                                      The smallest validation set error is obtained for ccp alpha = 0.372. The depen-
        {fig:exc-cpp-pruning}      dence of training and validation error on ccp alpha is shown in Fig. 7.2.
{fig:exc-dtreeviz-visualization}      (iii) The final model is visualized using dtreeviz in Fig. 7.3.
                        7 Modern analytic methods: Part I                                                           115
20
15
                                                   0
                                                        10   6       10 4         10 2           100
                                                             Cost-complexity parameter (ccp_alpha)
                        Fig. 7.2 Decision tree regressor complexity as a function of ccp alpha. The validation set error
                                                                                    exc:dec-tree-reg-distillation
{fig:exc-cpp-pruning}   is shown in black, the training set error in grey (Exercise 7.7)
                        ---------------------------------------------------------------------------AttributeError
                        Traceback (most recent call last)Cell In[1], line 10
                              5 model.fit(Xr, yr)
                              7 viz = dtreeviz.model(model, Xr, yr,
                              8                target_name=outcome,
                              9                feature_names=Xr.columns)
                        ---> 10 viz2pdf(viz,
                        'compiled/figures/Chap008_ex_dtreeviz_regressor.pdf')
                        Cell In[1], line 8, in viz2pdf(viz, pdfFile)
                              6 from tempfile import NamedTemporaryFile
                              7 with NamedTemporaryFile(mode='w+', suffix='.svg') as f:
                        ----> 8   f.write(viz.svg())
                              9   f.flush()
                             10   f.seek(0)
                        AttributeError: 'DTreeVizAPI' object has no attribute 'svg'
                        Solution 7.8 Vary the number of bins and binning strategy. The influence of the two
                        model parameter on model performance is shown in Fig. 7.4.                                         {fig:nb-binning-performance}
y = sensors['status']
                        results = []
116                                                             7 Modern analytic methods: Part I
≤ >
                                                                 exc:dec-tree-reg-distillation
Fig. 7.3 Decision tree visualization of regression tree (Exercise 7.7)                              {fig:exc-dtreeviz-visualization}
   The quantile binning strategy (for each feature, each bin has the same number
of data points) with splitting each column into two bins leads to the best performing
model. With this strategy, performance declines with increasing number of bins. The
uniform (for each feature, each bin has the same width) and the kmeans (for each
feature, a 𝑘-means clustering is used to bin the feature) strategies on the other hand,
show increasing performance with increasing number of bins.
   The confusion matrix for the best performing models is:
kbinsDiscretizer = KBinsDiscretizer(encode='ordinal',
  strategy='quantile', n_bins=2)
X_binned = kbinsDiscretizer.fit_transform(X)
nb_model = MultinomialNB()
nb_model.fit(X_binned, y)
print('Confusion matrix')
print(confusion_matrix(y, nb_model.predict(X_binned)))
 Confusion matrix
 [[87 5]
  [ 1 81]]
                               7 Modern analytic methods: Part I                                                         117
                                                                                                          kmeans
                                         0.950                                                            quantile
                                                                                                          uniform
                                         0.925
                                         0.900
                                         0.875
                                         0.850
                                         0.825
                                         0.800
                                                  2       3        4     5      6      7       8      9        10
                                                                              n_bins
                               Fig. 7.4 Influence of number of bins and binning strategy on model performance for the sensors
{fig:nb-binning-performance}   data with status as outcome
                                  The decision tree model missclassified three of the ‘Pass’ data points as ‘Fail’.
                               The Naı̈ve Bayes model on the other hand missclassifies six data points. However,
                               five of these are ‘Pass’ and predicted as ‘Fail’. Depending on your use case, you may
                               prefer a model with more false negatives or false positives.
food = mistat.load_data('FOOD.csv')
                               scaler = StandardScaler()
                               model = AgglomerativeClustering(n_clusters=10, compute_distances=True)
                               X = scaler.fit_transform(food)
                               model = model.fit(X)
                               fig, ax = plt.subplots()
                               plot_dendrogram(model, ax=ax)
                               ax.set_title('Dendrogram')
                               ax.get_xaxis().set_ticks([])
                               plt.show()
                               food = mistat.load_data('FOOD.csv')
                               scaler = StandardScaler()
                               X = scaler.fit_transform(food)
                                                                      Dendrogram
                              80
                              70
                              60
                              50
                              40
                              30
                              20
                              10
                                0
Fig. 7.5 Hierarchical clustering of food data set using Ward clustering {fig:food-ward-10-clusters}
{fig:food-compare-linkage}      The comparison of the different linkage methods is shown in Fig. 7.6. We can
                             see that Ward’s clustering gives the most balanced clusters; three bigger clusters and
                             seven small clusters. Complete, average, and single linkage lead to one big cluster.
                             Solution 7.11 We determine 10 clusters using K-means clustering.
                             sensors = mistat.load_data('SENSORS.csv')
                             predictors = [c for c in sensors.columns if c.startswith('sensor')]
                             outcome = 'status'
                             X = sensors[predictors]
                             scaler = StandardScaler()
                             X = scaler.fit_transform(X)
                             model = KMeans(n_clusters=10, random_state=1).fit(X)
                             df = pd.DataFrame({
                               'status': sensors['status'],
                             7 Modern analytic methods: Part I                                                               119
                                                         ward                                complete
                                        80
                                        60                                     20
                                        40
                                                                               10
                                        20
                                         0                                      0
                                                       average                                 single
                                        20                                   10.0
                                        15                                    7.5
                                        10                                    5.0
                                         5                                    2.5
                                         0                                    0.0
{fig:food-compare-linkage} Fig. 7.6 Comparison of different linkage methods for hierarchical clustering of food data set
                               'testResult': sensors['testResult'],
                               'cluster': model.predict(X),
                             })
                              Status Fail
                              cluster
                              2    27
                              8    14
                              3    13
                              4    10
                              7     9
                              0     9
                              6     5
                              9     3
                              5     1
                              1     1
                              Name: count, dtype: int64
                              Status Pass
                              cluster
                              7    45
                              0    37
                              Name: count, dtype: int64
                                There are several clusters that contain only ‘Fail’ data points. They correspond to
                             specific sensor value combinations that are very distinct to the sensor values during
                             normal operation. The ‘Pass’ data points are found in two clusters. Both of these
                             clusters contain also ‘Fail’ data points.
                                Analyse cluster membership by testResult.
 print(f'Cluster {cluster}')
 print(group['testResult'].value_counts())
 print()
Cluster 1
testResult
SOS    1
Name: count, dtype: int64
Cluster 2
testResult
ITM    27
Name: count, dtype: int64
Cluster 3
testResult
Grippers    11
ITM          2
Name: count, dtype: int64
Cluster 4
testResult
Velocity Type I    10
Name: count, dtype: int64
Cluster 5
testResult
SOS    1
Name: count, dtype: int64
Cluster 6
testResult
Velocity Type II    5
Name: count, dtype: int64
Cluster 7
testResult
Good        45
IMP          3
Brake        3
Grippers     1
Motor        1
ITM          1
Name: count, dtype: int64
Cluster 8
testResult
Motor     11
Brake      3
Name: count, dtype: int64
Cluster 9
testResult
ITM         2
Grippers    1
Name: count, dtype: int64
7 Modern analytic methods: Part I                                                      121
We can see that some of the test results are only found in one or two clusters.
Solution 7.12 The scikit-learn K-means clustering method can return either the
cluster centers or the distances of a data point to all the cluster centers. We evaluate
both as features for classification.
# Data preparation
sensors = mistat.load_data('SENSORS.csv')
predictors = [c for c in sensors.columns if c.startswith('sensor')]
outcome = 'status'
X = sensors[predictors]
scaler = StandardScaler()
X = scaler.fit_transform(X)
y = sensors[outcome]
   First, classifying data points based on the cluster center. In order to use that
information in a classifier, we transform the cluster center information using on-hot-
encoding.
       n_clusters   accuracy
0               2      0.529
1               3      0.805
2               4      0.805
3               5      0.805
4               6      0.799
5               7      0.828
6               8      0.874
7               9      0.897
8              10      0.897
9              11      0.879
10             12      0.885
11             13      0.902
12             14      0.902
13             15      0.902
14             16      0.885
15             17      0.879
16             18      0.879
17             19      0.885
122                                                      7 Modern analytic methods: Part I
          1.0
                                                                    accuracy
          0.9
0.8
0.7
0.6
          0.5
                   2.5    5.0     7.5     10.0 12.5        15.0     17.5
                                         n_clusters
Fig. 7.7 Dependence of accuracy on number of clusters using cluster membership as feature
      exc:kmeans-classifier
(Exc. 7.12)                                                                                  {fig:cluster-number-model}
   The accuracies are visualized in Fig. 7.7. We see that splitting the dataset into 7       {fig:cluster-number-model}
clusters gives a classification model with an accuracy of about 0.9.
   Next we use the distance to the cluster centers as variable in the classifier.
results = []
clf = DecisionTreeClassifier(ccp_alpha=0.012, random_state=0)
for n_clusters in range(2, 20):
  # fit a model and convert data to distances
  model = KMeans(n_clusters=n_clusters, random_state=1)
  model.fit(X)
  Xcluster = model.transform(X)
      n_clusters    accuracy
0              2       0.983
1              3       0.977
2              4       0.977
3              5       0.977
4              6       0.977
5              7       0.977
6              8       0.977
7              9       0.983
8             10       0.983
9             11       0.977
10            12       0.977
11            13       0.983
12            14       0.977
7 Modern analytic methods: Part I                                                            123
1.0
0.9
0.8
0.7
           0.6
                            accuracy
           0.5
                      2.5       5.0    7.5    10.0 12.5          15.0     17.5
                                             n_clusters
Fig. 7.8 Dependence of accuracy on number of clusters using distance to cluster center as feature
      exc:kmeans-classifier
(Exc. 7.12)                                                                                         {fig:cluster-distance-model}
13               15         0.983
14               16         0.977
15               17         0.971
16               18         0.977
17               19         0.977
   The accuracies of all models are very high. The largest accuracy is achived for 6
clusters.
   Based on these results, we would design the procedure using the decision tree
classifier combined with K-means clustering into six clusters. Using scikit-learn,
we can define the full procedure as a single pipeline as follows:
pipeline = make_pipeline(
  StandardScaler(),
  KMeans(n_clusters=6, random_state=1),
  DecisionTreeClassifier(ccp_alpha=0.012, random_state=0)
)
X = sensors[predictors]
y = sensors[outcome]
process = pipeline.fit(X, y)
print('accuracy', accuracy_score(y, process.predict(X)))
print('Confusion matrix')
print(confusion_matrix(y, process.predict(X)))
accuracy 0.9770114942528736
Confusion matrix
[[91 1]
 [ 3 79]]
import mistat
import networkx as nx
from pgmpy.estimators import HillClimbSearch
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
dissolution = mistat.load_data('DISSOLUTION.csv')
   Use shift registration to align the dissolution data with spline interpolation of
order 1, 2, and 3.
fd_registered = {}
for order in (1, 2, 3):
    fd.interpolation = SplineInterpolation(interpolation_order=order)
    fd_registered[order] = shift_registration.fit_transform(fd)
                                                                                125
                                       126                                                                                 8 Modern analytic methods: Part II
                                                                                                     Dissolution
                                                                   Order 1                               Order 2                            Order 3
                                                                                                                                      100
                                                              80                                80                                    80
Dissolution
Dissolution
                                                                                                                        Dissolution
                                                              60                                60                                    60
                                                              40                                40                                    40
                                                                                                                                      20
                                                                                                20
                                                                   20        40                          20        40                       20        40
                                       Fig. 8.1 Mean dissolution curves for reference and test tablets derived using spline interpolation
{fig:fdaDissolutionSplineComparison}   of order 1, 2, and 3.
                                          For each of the three registered datasets, calculate the mean dissolution curves
                                       for reference and test tablets and plot the results.
{fig:fdaDissolutionSplineComparison}      The dissolution curves are shown in Fig. 8.1. We can see in all three graphs,
                                       that the test tablets show a slightly faster dissolution than the reference tablets. If
                                       we compare the shape of the curves, the curves for the linear splines interpolation
                                       shows a levelling off with time. In the quadratic spline interpolation result, the
                                       dissolution curves go through a maximum. This behavior is unrealistic. The cubic
                                       spline interpolation also leads to unrealistic curves that first level of and then start to
                                       increase again.
                                       import skfda
                                       from skfda import FDataGrid
fd = FDataGrid(pinchraw.transpose(), pinchtime)
                                       fig = fd.plot()
                                       ax = fig.axes[0]
8 Modern analytic methods: Part II                                                             127
12
10
                       Pinch force
                                     6
ax.set_xlabel('Time [s]')
ax.set_ylabel('Pinch force')
plt.show()
   Fig. 8.2 shows the measure pinch forces. We can see that the start of the force                   {fig:fdaPinchforceOriginal}
varies from 0.025 to 0.1 seconds. This makes it difficult to compare the shape of the
curves. The shapes of the individual curves is not symmetric with a faster onset of
the force and slower decline.
   (iii) We create a variety of smoothed version of the dataset to explore the effect
of varying the smoothing parameter.
import itertools
from skfda.preprocessing.smoothing.kernel_smoothers import NadarayaWatsonSmoother
   Fig. 8.3 shows smoothed measurement curves for a variety of smoothing parameter    {fig:fdaPinchforceSmoothing}
values. If values are too large, the data are oversmoothed and the asymmetric shape
of the curves is lost. With decreasing values, the shape is reproduced better but the
curves are getting noisier again. We select 0.005 as the smoothing parameter.
smoother = NadarayaWatsonSmoother(smoothing_parameter=0.005)
fd_smooth = smoother.fit_transform(fd)
Pinch force
                                                                                                                Pinch force
                                                            6
                                                            4                                                                 5
                                                            2
                                                                0.0                 0.1          0.2     0.3                       0.0     0.1          0.2   0.3
                                                                                          Time                                                   Time
                                                            Smoothing parameter 0.001                                          Smoothing parameter 0.0001
                                                         10                                                                   10
                                           Pinch force
                                                                                                                Pinch force
                                                            5                                                                 5
12
10
                                                                                    8
                                                                      Pinch force
                                                                                    2
                                                                                          0.00    0.05   0.10     0.15             0.20   0.25     0.30
                                                                                                                Time [s]
                                max_idx = fd_smooth.data_matrix.argmax(axis=1)
                                landmarks = [pinchtime[idx] for idx in max_idx]
{fig:fdaPinchforceRegistered}       (iv) Fig. 8.4 shows the measurements after smoothing and landmark shift regis-
                                tration.
                                fig = fd_landmark.plot()
                                ax = fig.axes[0]
                                ax.set_xlabel('Time [s]')
                                ax.set_ylabel('Pinch force')
                                plt.show()
8 Modern analytic methods: Part II                                                     129
                        16
                        14
                        12
                        10
                        8
                        6
                        4
                        2
                        0
                                13       14   15      16     17
import skfda
frequencies = moisturespectrum['Moisturespectrum']['x']
spectra = moisturespectrum['Moisturespectrum']['y']
moisture = moisturevalues['Moisturevalues']
(ii) We can use a histogram to look at the distribution of the moisture values.
   Fig. 8.5 shows a bimodal distribution of the moisture content with a clear separa-         {fig:fdaMoistureDistribution}
tion of the two peaks. Based on this, we select 14.5 as the threshold to separate into
high and low moisture content.
    (iii - iv) The spectrum information is already in the array format required for the
FDataGrid class. In order to do this, we need to transpose the spectrum information.
As we can see in the left graph of Fig. 8.6, the spectra are not well aligned but show        {fig:fdaMoistureSpectrum}
a large spread in the intensities. This is likely due to the difficulty in having a clearly
defined concentration between samples. In order to reduce this variation, we can
transform the intensities by dividing the intensities by the mean of each sample.
intensities = spectra.transpose()
fd = skfda.FDataGrid(intensities, frequencies)
                                                                                                                          high
                                                                                                                          low
                                                                              2
                               1.2
1.0 1
0.8 0
0.6 1
                               0.4
                                                                              2
{fig:fdaMoistureSpectrum} Fig. 8.6 Near-infrared spectra of the moisture dataset. Left: raw spectra, Right: normalized spectra
 {fig:fdaMoistureSpectrum}       As we can see in right graph of Fig. 8.6, the normalized spectra are now better
                              aligned. We also see that the overall shape of the spectra is fairly consistent between
                              samples.
                                 (v) We repeat the model building both for the original and normalized spectra 50
                              times. At each iteration, we split the data set into training and test sets (50-50), build
                              the model with the training set and measure accuracy using the test set. By using the
                              same random seed for splitting the original and the normalized dataset, we can better
{fig:fdaMoistureAccuracies}   compare the models. The accuracies from the 50 iteration are compared in Fig. 8.7.
                              accuracies = []
                              for rs in range(50):
                                  X_train, X_test, y_train, y_test = train_test_split(fd,
                                      moisture_class, random_state=rs, test_size=0.5)
                                  knn_original = KNeighborsClassifier()
                                  knn_original.fit(X_train, y_train)
8 Modern analytic methods: Part II                                                                                           131
0.95
0.90
0.85
0.80
0.75
                                                              0.70
                                                                     0.700 0.725 0.750 0.775 0.800 0.825 0.850 0.875 0.900
                                                                              Accuracy of models based on original spectra
Fig. 8.7 Accuracies of classification models based original and normalized spectra. The line
indicates equal performance.                                                                                                       {fig:fdaMoistureAccuracies}
# mean of accuracies
mean_accuracy = accuracies.mean()
mean_accuracy
original       0.7976
normalized     0.9468
dtype: float64
   Fig. 8.7 clearly shows that classification models based on the normalized spectra                                               {fig:fdaMoistureAccuracies}
achieve better accuracies. The mean accuracy increases from 0.8 to 0.95.
1.0
                       Fig. 8.8 Mean absolute error of regression models using original and normalized spectra. The line
{fig:fdaMoistureMAE}   indicates equal performance.
                       mae = []
                       for rs in range(50):
                           X_train, X_test, y_train, y_test = train_test_split(fd,
                               moisture, random_state=rs, test_size=0.5)
                           knn_original = KNeighborsRegressor()
                           knn_original.fit(X_train, y_train)
                           mae_original = mean_absolute_error(y_test, knn_original.predict(X_test))
                       # mean of MAE
                       mean_mae = mae.mean()
                       mean_mae
                       original       0.817016
                       normalized     0.433026
                       dtype: float64
{fig:fdaMoistureMAE}      Fig. 8.8 clearly shows that regression models based on the normalized spectra
                       achieve better performance. The mean absolute error descrease from 0.82 to 0.43.
                          (iii) We use the last regression model from (ii) to create a plot of actual versus
                       predicted moisture content for the test data.
8 Modern analytic methods: Part II                                                     133
17
15
14
13
                                                13   14           15         16   17
                                                          Moisture content
y_pred = knn_normalized.predict(X_test)
predictions = pd.DataFrame({'actual': y_test, 'predicted': y_pred})
minmax = [min(*y_test, *y_pred), max(*y_test, *y_pred)]
ax = predictions.plot.scatter(x='actual', y='predicted')
ax.set_xlabel('Moisture content')
ax.set_ylabel('Predicted moisture content')
ax.plot(minmax, minmax, color='grey')
plt.show()
   Fig. 8.9 shows two clusters of points. One cluster contains the samples with the          {fig:fdaMoisturePredictions}
high moisture content and the other cluster the samples with low moisture content.
The clusters are well separated and the predictions are in the typical range for
each cluster. However, within a cluster, predictions and actual values are not highly
correlated. In other words, while the regression model can distinguish between
samples with a high and low moisture content, the moisture content is otherwise
not well predicted. There is therefore no advantage of using the regression model
compared to the classification model.
Solution 8.5 (i) See solution for Exercise 8.3.                                              {exc:fda-moisture-classification}
(ii) In Python:
fpca_original = FPCA(n_components=2)
_ = fpca_original.fit(fd)
fpca_normalized = FPCA(n_components=2)
_ = fpca_normalized.fit(fd_normalized)
                                          2.0                               1.5
                                          1.5                               1.0
                                          1.0                               0.5
                                          0.5
                                                                        1
                                                                            0.0
                                          0.0                               0.5
                                          0.5                               1.0
                                                5.0   2.5 0.0 2.5 5.0                 2       0       2
                                                        Component 2                       Component 2
                       Fig. 8.10 Projection of spectra onto first two principal components. Left: original spectra, right:
{fig:fdaMoisturePCA}   normalized spectra
                         fpca_df.plot.scatter(x=0, y=1,
                             c=['C1' if mc == 'high' else 'C2' for mc in moisture_class], ax=ax)
                         ax.set_xlabel('Component 1')
                         ax.set_xlabel('Component 2')
{fig:fdaMoisturePCA}      Fig. 8.10 compares the PCA projections for the original and normalized data.
                       We can see that the second principal component clearly separates the two moisture
                       content classes for the normalized spectra. This is not the case for the original spectra.
                       Solution 8.6 (i) We demonstrate a solution to this exercise using two blog posts
                       preprocessed and included in the mistat package. The content of the posts was
                       converted text with each paragraph on a line. The two blog posts can be loaded as
                       follows:
                          The variable blogs is a dictionary with labels as keys and text as values. We next
                       split the data into a list of labels and a list of non-empty paragraphs.
                       paragraphs = []
                       labels = []
                       for blog, text in blogs.items():
                         for paragraph in text.split('\n'):
                           paragraph = paragraph.strip()
                           if not paragraph: # ignore empty paragraphs
                             continue
                           paragraphs.append(paragraph)
                           labels.append(blog)
import re
from sklearn.feature_extraction.text import CountVectorizer
def preprocessor(text):
    text = text.lower()
    text = re.sub(r'\d[\d,]*', '', text)
    text = '\n'.join(line for line in text.split('\n')
                     if not line.startswith('ntsb'))
    return text
vectorizer = CountVectorizer(preprocessor=preprocessor,
                             stop_words='english')
counts = vectorizer.fit_transform(paragraphs)
termCounts = np.array(counts.sum(axis=0)).flatten()
topCounts = termCounts.argsort()
terms = vectorizer.get_feature_names_out()
for n in reversed(topCounts[-10:]):
  print(f'{terms[n]:14s} {termCounts[n]:3d}')
global              63
climate             59
warming             57
change              55
ice                 35
sea                 34
earth               33
ocean               29
temperatures        28
heat                25
(iv)
terms = vectorizer.get_feature_names_out()
data = {}
for i, component in enumerate(svd.components_, 1):
136                                                                                                             8 Modern analytic methods: Part II
  compSort = component.argsort()
  idx = list(reversed(compSort[-10:]))
  data[f'Topic {i}'] = [terms[n] for n in idx]
  data[f'Loading {i}'] = [component[n] for n in idx]
df = pd.DataFrame(data)
print("{\\tiny")
print(df.style.format(precision=2).hide(axis='index').to_latex(hrules=True))
print("}")
Topic 1 Loading 1 Topic 2 Loading 2 Topic 3 Loading 3 Topic 4 Loading 4 Topic 5 Loading 5
      change             0.24   ice               0.39   sea                 0.25   extreme              0.53   snow                  0.44
      climate            0.24   sea               0.34   earth               0.23   events               0.31   cover                 0.27
      global             0.23   sheets            0.27   energy              0.21   heat                 0.23   sea                   0.14
      sea                0.22   shrinking         0.19   light               0.21   precipitation        0.20   decreased             0.13
      warming            0.21   level             0.17   gases               0.19   earth                0.13   temperatures          0.13
      ice                0.20   arctic            0.15   ice                 0.18   light                0.13   hurricanes            0.11
      temperature        0.17   ocean             0.12   infrared            0.17   energy               0.12   level                 0.10
      ocean              0.16   declining         0.10   greenhouse          0.16   gases                0.11   increase              0.10
      earth              0.16   levels            0.08   level               0.15   infrared             0.11   climate               0.10
      extreme            0.15   glaciers          0.08   arctic              0.12   greenhouse           0.10   rise                  0.09
   We can identify topics related to sea warming, ice sheets melting, greenhouse
effect, extreme weather events, and hurricanes.
   (v) Repeat the analysis requesting 10 components in the SVD.
svd = TruncatedSVD(10)
norm_tfidf = Normalizer().fit_transform(tfidf)
lsa_tfidf = svd.fit_transform(norm_tfidf)
terms = vectorizer.get_feature_names_out()
data = {}
for i, component in enumerate(svd.components_, 1):
  compSort = component.argsort()
  idx = list(reversed(compSort[-10:]))
  data[f'Topic {i}'] = [terms[n] for n in idx]
  data[f'Loading {i}'] = [component[n] for n in idx]
df = pd.DataFrame(data)
Topic 1 Loading 1 Topic 2 Loading 2 Topic 3 Loading 3 Topic 4 Loading 4 Topic 5 Loading 5
      change             0.24   ice               0.39   sea                 0.25   extreme              0.54   snow                  0.39
      climate            0.24   sea               0.35   earth               0.24   events               0.31   cover                 0.23
      global             0.23   sheets            0.27   energy              0.21   heat                 0.24   sea                   0.18
      sea                0.22   shrinking         0.19   light               0.21   precipitation        0.20   hurricanes            0.13
      warming            0.21   level             0.17   gases               0.19   light                0.13   level                 0.13
      ice                0.20   arctic            0.15   ice                 0.18   earth                0.12   climate               0.13
      temperature        0.18   ocean             0.13   infrared            0.17   energy               0.12   temperatures          0.12
      ocean              0.16   declining         0.10   greenhouse          0.16   gases                0.12   decreased             0.11
      earth              0.16   levels            0.08   level               0.15   greenhouse           0.11   increase              0.10
      extreme            0.15   glaciers          0.07   atmosphere          0.12   infrared             0.11   temperature           0.10
Topic 6 Loading 6 Topic 7 Loading 7 Topic 8 Loading 8 Topic 9 Loading 9 Topic 10 Loading 10
      sea                0.37   snow                 0.34   ocean                0.40   glaciers        0.37    responsibility         0.33
      level              0.35   ocean                0.29   acidification        0.20   retreat         0.26    authorities            0.26
      rise               0.17   cover                0.25   hurricanes           0.18   glacial         0.23    pollution              0.19
      extreme            0.14   acidification        0.18   glaciers             0.16   water           0.22    personal               0.14
      global             0.12   extreme              0.17   water                0.15   months          0.15    arctic                 0.11
      events             0.12   decreased            0.14   waters               0.12   summer          0.14    change                 0.11
      temperature        0.12   carbon               0.11   temperatures         0.11   plants          0.12    wildfires              0.10
      hurricanes         0.10   point                0.11   marine               0.09   stream          0.11    carbon                 0.10
      impacts            0.10   pollution            0.11   coral                0.09   animals         0.11    percent                0.10
      coastal            0.07   waters               0.11   reefs                0.09   going           0.11    stop                   0.10
   The first five topics are identical to the result in (iv). This is an expected property
of the SVD.
               8 Modern analytic methods: Part II                                                             137
0.4
0.2
0.0
0.2
                                0.4
                                      0.00   0.05   0.10   0.15 0.20 0.25      0.30   0.35   0.40
                                                           Second component
               Fig. 8.11 Projection of the documents onto first two singular values. Red: blog post 1, green: blog
               post 2                                                                                                {fig:nlpSVD}
{fig:nlpSVD}      (vi) Fig. 8.11 shows the individual documents projected onto the first two singular
               values of the LSA. Based on this visualization, we can say that the two documents
               discuss different aspects of global warming and that blog post 1 contains more details
               about the area.
               fig, ax = plt.subplots()
               blog1 = [label == 'blog-1' for label in labels]
               blog2 = [label == 'blog-2' for label in labels]
               ax.plot(lsa_tfidf[blog1, 0], lsa_tfidf[blog1, 1], 'ro')
               ax.plot(lsa_tfidf[blog2, 0], lsa_tfidf[blog2, 1], 'go')
               ax.set_xlabel('First component')
               ax.set_xlabel('Second component')
               plt.show()
               Solution 8.7 We follow the same procedure as in Exercise 8.6 using a set of three                     {exc:nlp-topic-1}
               articles preprocessed and included in the mistat package.
               paragraphs = []
               labels = []
               for blog, text in blogs.items():
                 for paragraph in text.split('\n'):
                   paragraph = paragraph.strip()
                   if not paragraph:
                     continue
                   paragraphs.append(paragraph)
                   labels.append(blog)
               def preprocessor(text):
                     138                                                                                                          8 Modern analytic methods: Part II
                      text = text.lower()
                      text = re.sub(r'\d[\d,]*', '', text)
                      text = '\n'.join(line for line in text.split('\n')
                                       if not line.startswith('ntsb'))
                      return text
TF-IDF transformation
                     svd = TruncatedSVD(10)
                     tfidf = Normalizer().fit_transform(tfidf)
                     lsa_tfidf = svd.fit_transform(tfidf)
Topics:
                     terms = vectorizer.get_feature_names_out()
                     data = {}
                     for i, component in enumerate(svd.components_, 1):
                       compSort = component.argsort()
                       idx = list(reversed(compSort[-10:]))
                       data[f'Topic {i}'] = [terms[n] for n in idx]
                       data[f'Loading {i}'] = [component[n] for n in idx]
                     df = pd.DataFrame(data)
Topic 1 Loading 1 Topic 2 Loading 2 Topic 3 Loading 3 Topic 4 Loading 4 Topic 5 Loading 5
                           labour          0.29   labour             0.28   economic         0.24   capacity             0.21   covid                  0.22
                           covid           0.22   south              0.27   percent          0.23   financial            0.20   america                0.21
                           impact          0.19   north              0.22   covid            0.19   firms                0.18   reset                  0.21
                           market          0.19   differences        0.20   gdp              0.17   international        0.17   latin                  0.20
                           south           0.18   americas           0.17   impact           0.15   household            0.17   economic               0.18
                           america         0.16   channel            0.16   imf              0.13   access               0.14   needs                  0.18
                           pandemic        0.15   agenda             0.14   social           0.12   depends              0.14   social                 0.16
                           channel         0.15   covid              0.13   pre              0.12   largely              0.14   asymmetric             0.12
                           economic        0.14   post               0.11   growth           0.10   state                0.13   consequences           0.11
                           north           0.14   welfare            0.09   world            0.10   support              0.13   region                 0.11
Topic 6 Loading 6 Topic 7 Loading 7 Topic 8 Loading 8 Topic 9 Loading 9 Topic 10 Loading 10
                           economic            0.29   covid                 0.21   crisis             0.19   self                 0.17   north                0.14
                           channel             0.20   occupations           0.20   poverty            0.16   employed             0.16   differences          0.14
                           social              0.18   asymmetric            0.20   labor              0.15   unbearable           0.15   need                 0.13
                           market              0.14   economic              0.20   deepen             0.15   lightness            0.15   south                0.13
                           recovery            0.12   consequences          0.19   inequality         0.12   informal             0.13   americas             0.12
                           labour              0.11   transition            0.15   differences        0.11   international        0.13   work                 0.12
                           impact              0.10   social                0.13   north              0.10   poverty              0.13   governments          0.11
                           preliminary         0.09   differences           0.13   deleterious        0.10   economic             0.11   precarious           0.10
                           preservation        0.09   occupation            0.13   large              0.10   financial            0.11   world                0.09
                           consequences        0.09   americas              0.13   frictions          0.10   inequality           0.11   majority             0.09
                     fig, ax = plt.subplots()
                     for blog in blogs:
                         match = [label == blog for label in labels]
                         ax.plot(lsa_tfidf[match, 0], lsa_tfidf[match, 1], 'o', label=blog)
                     ax.legend()
8 Modern analytic methods: Part II                                                             139
                 0.5        blog-1
                 0.4        blog-2
                            blog-3
                 0.3
                 0.2
                 0.1
                 0.0
                 0.1
                 0.2
                 0.3
                              0.1        0.2       0.3          0.4         0.5
                                          Second component
Fig. 8.12 Projection of the documents onto first two singular values. Red: blog post 1, green: blog
post 2                                                                                                {fig:nlpSVD-covid}
ax.set_xlabel('First component')
ax.set_xlabel('Second component')
plt.show()
data = mistat.load_data('LAPTOP_REVIEWS')
data['Review'] = data['Review title'] + ' ' + data['Review content']
reviews = data.dropna(subset=['User rating', 'Review title', 'Review content'])
(ii) Convert the text representation into a document term matrix (DTM).
import re
from sklearn.feature_extraction.text import CountVectorizer
def preprocessor(text):
    text = text.lower()
    text = re.sub(r'\d[\d,]*', '', text)
    return text
vectorizer = CountVectorizer(preprocessor=preprocessor,
                             stop_words='english')
counts = vectorizer.fit_transform(reviews['Review'])
print('shape of DTM', counts.shape)
print('total number of terms', np.sum(counts))
(iii) Convert the counts in the document term matrix (DTM) using TF-IDF.
    (iv) Using scikit-learn’s TruncatedSVD method, we convert the sparse tfidf ma-
trix to a denser representation.
(7433, 20)
   (v) We use logistic regression to classify reviews with a user rating of five as
positive and negative otherwise.
0.769334229993275