R Programming(20MCA2PERR)
UNIT - 3
1
Descriptive Statistics
• Descriptive statistics (in the broad sense of the term) is a branch of statistics aiming
at summarizing, describing and presenting a series of values or a dataset.
• Descriptive statistics is often the first step and an important part in any statistical
analysis. It allows to check the quality of the data and it helps to “understand” the
data by having a clear overview of it.
• If well presented, descriptive statistics is already a good starting point for further
analyses.
2
Represents the whole set of data by a single Measure of variability is known as the spread of data
value. It gives the location of central points or how well is our data is distributed.
Need of Descriptive Analysis
• Descriptive Analysis helps us to understand our data and is a very
important part of Machine Learning.
• This is due to Machine Learning being all about making predictions.
• On the other hand, statistics is all about drawing conclusions from data,
which is a necessary initial step for Machine Learning.
4
Description R function
Mean mean()
Standard deviation sd()
Variance var()
Minimum min()
Maximum maximum()
Median median()
Range of values (minimum
range()
and maximum)
Sample quantiles quantile()
Generic function summary()
Interquartile range IQR()
5
Data
• The dataset iris used for demonstration purpose. This dataset is imported by default in R. It is to
be loaded by running iris.
Attribute Information:
dat <- iris (or) data(iris)# load the iris dataset and rename it dat
• To preview of this dataset and its structure: 1. sepal length in cm
head(dat) # first 6 observations, try tail() 2. sepal width in cm
• library(help="datasets") 3. petal length in cm
4. petal width in cm
Sepal.Length Sepal.Width Petal.Length Petal.Width Species 5. class:
-- Iris Setosa
1 5.1 3.5 1.4 0.2 setosa -- Iris Versicolour
2 4.9 3.0 1.4 0.2 setosa -- Iris Virginica
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
6
str(dat) # structure of dataset
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
The dataset contains 150 observations and 5 variables, representing the length and
width of the sepal and petal and the species of 150 flowers.
Length and width of the sepal and petal are numeric variables and the species is a factor
with 3 levels
7
Minimum and maximum
min(dat$Sepal.Length)
## [1] 4.3
max(dat$Sepal.Length)
## [1] 7.9
8
range()
rng <- range(dat$Sepal.Length)
rng
## [1] 4.3 7.9
Note: The output of the range() function is actually an object containing the minimum and
maximum
rng[1] # rng = name of the object specified above
## [1] 4.3 # name of the object with minimum value
rng[2]
## [1] 7.9 # name of the object with minimum value
9
Mean
mean(dat$Sepal.Length)
## [1] 5.843333
Note:
• if there is at least one missing value in the dataset, use
mean(dat$Sepal.Length, na.rm = TRUE) to compute the mean with
the NA excluded. This argument can be used for most functions
presented in this article, not only the mean
• for a truncated mean, use mean(dat$Sepal.Length, trim = 0.10) and
change the trim argument to the needs
10
Median
median(dat$Sepal.Length)
## [1] 5.8
or with the quantile() function:
quantile(dat$Sepal.Length, 0.5)
#since the quantile of order 0.5 (q0.5) corresponds to the median.
## 50%
## 5.8
11
Quartile
12
First and third quartile
quantile(dat$Sepal.Length, 0.25) # first quartile
## 25%
## 5.1
quantile(dat$Sepal.Length, 0.75) # third quartile
## 75%
## 6.4
13
Other quantiles
quantile(dat$Sepal.Length, 0.4) # 4th decile
## 40%
## 5.6
quantile(dat$Sepal.Length, 0.98) # 98th percentile
## 98%
## 7.7
14
Interquartile range
IQR(dat$Sepal.Length) #the difference between the first and third quartile
## [1] 1.3
(OR)
quantile(dat$Sepal.Length, 0.75) - quantile(dat$Sepal.Length, 0.25)
## 75%
## 1.3
15
Standard deviation and variance
sd(dat$Sepal.Length) # standard deviation
## [1] 0.8280661
var(dat$Sepal.Length) # variance
## [1] 0.6856935
16
Sepal.Length Sepal.Width Petal.Length Petal.WidthSpecies
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa Sepal.Length Sepal.Width Petal.Length Petal.Width Species
6 5.4 3.9 1.7 0.4 setosa 145 6.7 3.3 5.7 2.5 virginica
146 6.7 3.0 5.2 2.3 virginica
Sepal.Length Sepal.Width Petal.Length Petal.Width Species 147 6.3 2.5 5.0 1.9 virginica
1 5.1 3.5 1.4 0.2 setosa 148 6.5 3.0 5.2 2.0 virginica
2 4.9 3.0 1.4 0.2 setosa 149 6.2 3.4 5.4 2.3 virginica
3 4.7 3.2 1.3 0.2 setosa 150 5.9 3.0 5.1 1.8 virginica
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
17
dat<-iris
head(dat)
head(dat,30)
print("Structure of IRIS data set",quote="FALSE")
print(str(dat))
print(paste("Minimum of sepal length",min(dat$Sepal.Length)))
print(paste("Maximum of sepal length",max(dat$Sepal.Length)))
print(paste("Range of sepal length",range(dat$Sepal.Length)))
print(paste("Mean value of sepal length",mean(dat$Sepal.Length)))
print(paste("Mean value of sepal length(trimmed)",mean(dat$Sepal.Length,trim=0.10)))
print(paste("Mean value of sepal length(rounded)",round(mean(dat$Sepal.Length),digits=2)))
print(paste("Median value of sepal length",median(dat$Sepal.Length)))
print(paste("Quantile value of sepal length",quantile(dat$Sepal.Length,0.5)))
print(paste("First quartile value of sepal length",quantile(dat$Sepal.Length,0.25)))
print(paste("Third quartile value of sepal length",quantile(dat$Sepal.Length,0.75)))
print(paste("Inter quartile range of sepal length",IQR(dat$Sepal.Length)))
print(paste("Fourth decile value of sepal length",quantile(dat$Sepal.Length,0.4)))
print(paste("98th percentile value of sepal length",quantile(dat$Sepal.Length,0.98)))
print(paste("Standard Deviation of sepal length",sd(dat$Sepal.Length)))
print(paste("Variance value of sepal length",var(dat$Sepal.Length)))
18
[1] Structure of IRIS data set
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
NULL
[1] "Minimum of sepal length 4.3"
[1] "Maximum of sepal length 7.9"
[1] "Range of sepal length 4.3" "Range of sepal length 7.9"
[1] "Mean value of sepal length 5.84333333333333"
[1] "Mean value of sepal length(trimmed) 5.80833333333333"
[1] "Mean value of sepal length(rounded) 5.84"
[1] "Median value of sepal length 5.8"
[1] "Quantile value of sepal length 5.8"
[1] "First quartile value of sepal length 5.1"
[1] "Third quartile value of sepal length 6.4"
[1] "Inter quartile range of sepal length 1.3"
[1] "Fourth decile value of sepal length 5.6"
[1] "98th percentile value of sepal length 7.7"
[1] "Standard Deviation of sepal length 0.828066127977863"
[1] "Variance value of sepal length 0.685693512304251" 19
Summary
• We can compute the minimum, 1st quartile, median, mean,3rd quartile and the maximum for all numeric
variables of a dataset at once using summary():
summary(dat)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
20
Coefficient of variation
The coefficient of variation: The coefficient of variation (CV) is a statistical measure
of the relative dispersion of data points in a data series around the mean.
sd(dat$Sepal.Length) / mean(dat$Sepal.Length)
## [1] 0.1417113
21
Refer to the snow data that is given below.
year snowcover
1970 6.5
1971 12.0
1972 14.9
1973 10.0
1974 10.7
1975 7.9
1976 21.9
1977 12.5
1978 14.5
1979 9.2
(a) Create a dataframe by name snow using the above data.
(b) Display the last two rows of the dataframe snow.
(c) Display the mean, median, first quartile, third quartile, Inter quartile range, 4th decile and coefficient of
Variance of snowcover
(d) Find the mean of the snowcover (a) for the odd-numbered years and (b) for the even-numbered years.
22
year<-seq(1970,1979)
snowcover<-c(6.5,12.0,14.9,10.0,10.7,7.9,21.9,12.5,14.5,9.2)
snow<-data.frame(year,snowcover)
print(snow)
print(tail(snow,2))
print(paste("Mean value of snowcover(rounded)",round(mean(snow$snowcover),digits=2)))
print(paste("Median value of snowcover",median(snow$snowcover)))
print(paste("Quantile value of snowcover",quantile(snow$snowcover,0.5)))
print(paste("First quartile value of snowcover",quantile(snow$snowcover,0.25)))
print(paste("Third quartile value of snowcover",quantile(snow$snowcover,0.75)))
print(paste("Inter quartile range of snowcover",IQR(snow$snowcover)))
print(paste("Fourth decile value of snowcover",quantile(snow$snowcover,0.4)))
print(paste("Coefficient of Variation",sd(snow$snowcover) / mean(snow$snowcover)))
print(paste("Mean of snowcover of even-numbered years",mean(snow$snowcover[seq(2,10,2)])))
print(paste("Mean of snowcover of Odd-numbered years",mean(snow$snowcover[seq(1,9,2)])))
23
year snowcover
1 1970 6.5
2 1971 12.0
3 1972 14.9
4 1973 10.0
5 1974 10.7
6 1975 7.9
7 1976 21.9
8 1977 12.5
9 1978 14.5
10 1979 9.2
year snowcover
9 1978 14.5
10 1979 9.2
[1] "Mean value of snowcover(rounded) 12.01"
[1] "Median value of snowcover 11.35"
[1] "Quantile value of snowcover 11.35"
[1] "First quartile value of snowcover 9.4"
[1] "Third quartile value of snowcover 14"
[1] "Inter quartile range of snowcover 4.6"
[1] "Fourth decile value of snowcover 10.42“
[1] "Coefficient of Variation 0.365592048382383"
[1] "Mean of snowcover of even-numbered years 10.32"
24
[1] "Mean of snowcover of Odd-numbered years 13.7"
Write a program to import the built in dataset of R "mtcars" having the following structure
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
i) Display the structure of the mtcars
ii) Display the first 3 rows of mtcars
iii) Extract a data frame namely mtcars6 that holds only the information for cars with 6 cylinders.
iv) Display the records of cars having 5 gears
v) Find the mean of mpg for cars having 5 gears
25
cat("The structure of mtcars")
str(mtcars)
cat("The first three rows are","\n")
print(head(mtcars,3))
#extract a data frame mtcars6 that holds only the information for cars with 6 cylinders
mtcars6<-mtcars[mtcars$cyl==6,]
cat("Details of cars having 6 cylinders","\n")
print(mtcars6)
#Display the records of cars having 5 gears
cat("Details of cars having 5 gears","\n")
mtcar1<-mtcars[mtcars$gear==5,]
print(mtcar1)
#Find the mean of mpg for cars having 5 gears
cat("Mean value of mpg for cars having 5 gears ")
cat(mean(mtcars$mpg[mtcars$gear==5]))
26
The structure of mtcars'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
The first three rows are
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
27
Details of cars having 6 cylinders
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Details of cars having 5 gears
mpg cyl disp hp drat wt qsec vs am gear carb
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
Mean value of mpg for cars having 5 gears 21.38
28
Write a program to import the built in dataset of R “airquality” having the following structure
'data.frame': 153 obs. of 6 variables:
$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
$ Solar: int 190 118 149 313 NA NA 299 99 19 194 ...
$ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
$ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
$ Month : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 2 3 4 5 6 7 8 9 10 ...
(a) Display the first 6 rows of the airquality dataset
(b) Determine, for each of the columns of the data frame airquality (datasets package), the median, mean,
upper and lower quartiles, and range;
(c) Display the mean value of Solar field
(d) Display the row or rows for which Ozone has its maximum value;
(e) Display the vector of values of Wind for values of Ozone that are above the upper quartile.
29
print(head(airquality))
print(summary(airquality))
print(mean(airquality$Solar))
print(airquality[airquality$Ozone == max(airquality$Ozone),])
print(airquality$Wind[airquality$Ozone > (quantile(airquality$Ozone, .75,na.rm=TRUE))])
30
To compute the standard deviation (or variance) of multiple variables at the same
time, use lapply() with the appropriate statistics as second argument
lapply(dat[, 1:4], sd)
## $Sepal.Length
## [1] 0.8280661
##
## $Sepal.Width
## [1] 0.4358663
##
## $Petal.Length
## [1] 1.765298
##
## $Petal.Width
## [1] 0.7622377
31
Note: If you need these descriptive statistics by group use the by() ## dat$Species: versicolor
function: ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## Min. :4.900 Min. :2.000 Min. :3.00 Min. :1.000 setosa : 0
by(dat, dat$Species, summary) ## 1st Qu.:5.600 1st Qu.:2.525 1st Qu.:4.00 1st Qu.:1.200 versicolor:50
## dat$Species: setosa ## Median :5.900 Median :2.800 Median :4.35 Median :1.300 virginica : 0
## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Mean :5.936 Mean :2.770 Mean :4.26 Mean :1.326
## Min. :4.300 Min. :2.300 Min. :1.000 Min. :0.100 ## 3rd Qu.:6.300 3rd Qu.:3.000 3rd Qu.:4.60 3rd Qu.:1.500
## 1st Qu.:4.800 1st Qu.:3.200 1st Qu.:1.400 1st Qu.:0.200 ## Max. :7.000 Max. :3.400 Max. :5.10 Max. :1.800
## Median :5.000 Median :3.400 Median :1.500 Median :0.200 ## ------------------------------------------------------------
## Mean :5.006 Mean :3.428 Mean :1.462 Mean :0.246 ## dat$Species: virginica
## 3rd Qu.:5.200 3rd Qu.:3.675 3rd Qu.:1.575 3rd Qu.:0.300 ## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Max. :5.800 Max. :4.400 Max. :1.900 Max. :0.600 ## Min. :4.900 Min. :2.200 Min. :4.500 Min. :1.400
## Species ## 1st Qu.:6.225 1st Qu.:2.800 1st Qu.:5.100 1st Qu.:1.800
## setosa :50 ## Median :6.500 Median :3.000 Median :5.550 Median :2.000
## versicolor: 0 ## Mean :6.588 Mean :2.974 Mean :5.552 Mean :2.026
## virginica : 0 ## 3rd Qu.:6.900 3rd Qu.:3.175 3rd Qu.:5.875 3rd Qu.:2.300
## ## Max. :7.900 Max. :3.800 Max. :6.900 Max. :2.500
## ## Species
## ## setosa : 0
## ------------------------------------------------------------ ## versicolor: 0
## virginica :50
##
##
##
32