Unit 4
Unit 4
In Descriptive statistics in R Programming Language, we describe our data with the help of various representative methods using charts,
graphs, tables, excel files, etc. In the descriptive analysis, we describe our data in some manner and present it in a meaningful way so
that it can be easily understood.
Most of the time it is performed on small data sets and this analysis helps us a lot to predict some future trends based on the current
findings. Some measures that are used to describe a data set are measures of central tendency and measures of variability or
dispersion.
Process of Descriptive Statistics in R
The measure of central tendency
Measure of variability
Measure of variability
In Descriptive statistics in R measure of variability is known as the spread of data or how well is our data is distributed. The most
common variability measures are:
Range
Variance
Standard deviation
Need of Descriptive Statistics in R
Descriptive Analysis helps us to understand our data and is a very important part of Machine Learning. This is due to Machine Learning
being all about making predictions. On the other hand, statistics is all about drawing conclusions from data, which is a necessary initial
step for Machine Learning. Let’s do this descriptive analysis in R.
Descriptive Analysis in R
Descriptive analyses consist of describing simply the data using some summary statistics and graphics. Here, we’ll describe how to
compute summary statistics using R software.
Import your data into R:
Before doing any computation, first of all, we need to prepare our data, save our data in external .txt or .csv files and it’s a best practice
to save the file in the current directory. After that import, your data into R as follow:
R
# R program to illustrate
# Descriptive Analysis
myData = read.csv("CardioGoodFitness.csv",
stringsAsFactors = F)
print(head(myData))
Output:
Product Age Gender Education MaritalStatus Usage Fitness Income Miles
1 TM195 18 Male 14 Single 3 4 29562 112
2 TM195 19 Male 15 Single 2 3 31836 75
3 TM195 19 Female 14 Partnered 4 3 30699 66
4 TM195 19 Male 12 Single 3 3 32973 85
5 TM195 20 Male 13 Partnered 4 2 35247 47
6 TM195 20 Female 14 Partnered 3 3 32973 66
R functions for computing descriptive analysis:
Histogram of Age Distribution
R
library(ggplot2)
Output:
The ggplot2 library to create a histogram of the ‘Age’ variable from the ‘myData’ dataset. The histogram bins have a width of 2, and the
bars are filled with a teal color with a light gray border. The resulting visualization shows the distribution of ages in the dataset.
Boxplot of Miles by Gender
R
geom_boxplot() +
theme_minimal()
Output:
Descriptive Analysis in R Programming
We create a boxplot visualizing the distribution of ‘Miles’ run, segmented by ‘Gender’ from the ‘myData’ dataset. Each boxplot represents
the interquartile range (IQR) of Miles for each gender. The plot is titled “Miles Distribution by Gender,” with ‘Gender’ on the x-axis and
‘Miles’ on the y-axis. The plot is styled with a minimal theme.
Bar Chart of Education Levels
R
geom_bar() +
theme_minimal()
Output:
Descriptive Analysis in R Programming
We generate a bar chart illustrating the distribution of ‘Education’ levels from the ‘myData’ dataset. Each bar represents the count of
observations for each education level. The chart is titled “Education Distribution,” with ‘Education Level’ on the x-axis and ‘Count’ on the
y-axis. The visualization adopts a minimal theme for a clean and simple presentation.
Mean
It is the sum of observations divided by the total number of observations. It is also defined as average which is the sum divided by count.
# R program to illustrate
# Descriptive Analysis
myData = read.csv("CardioGoodFitness.csv",
stringsAsFactors = F)
mean = mean(myData$Age)
print(mean)
Output:
[1] 28.78889
Median
It is the middle value of the data set. It splits the data into two halves. If the number of elements in the data set is odd then the center
element is median and if it is even then the median would be the average of two central elements.
# R program to illustrate
# Descriptive Analysis
myData = read.csv("CardioGoodFitness.csv",
stringsAsFactors = F)
median = median(myData$Age)
print(median)
Output:
[1] 26
Mode
It is the value that has the highest frequency in the given data set. The data set may have no mode if the frequency of all data points is
the same. Also, we can have more than one mode if we encounter two or more data points having the same frequency.
R
# R program to illustrate
# Descriptive Analysis
library(modeest)
myData = read.csv("CardioGoodFitness.csv",
stringsAsFactors = F)
# Compute the mode value
mode = mfv(myData$Age)
print(mode)
Output:
[1] 25
Range
The range describes the difference between the largest and smallest data point in our data set. The bigger the range, the more is the
spread of data and vice versa.
Range = Largest data value – smallest data value
# R program to illustrate
# Descriptive Analysis
myData = read.csv("CardioGoodFitness.csv",
stringsAsFactors = F)
max = max(myData$Age)
min = min(myData$Age)
cat("Range is:\n")
print(range)
r = range(myData$Age)
print(r)
Output:
Range is:
[1] 32
[1] 18 50
Variance
It is defined as an average squared deviation from the mean. It is being calculated by finding the difference between every data point
and the average which is also known as the mean, squaring them, adding all of them, and then dividing by the number of data points
present in our data set.
where,
N = number of terms
u = Mean
R
# R program to illustrate
# Descriptive Analysis
myData = read.csv("CardioGoodFitness.csv",
stringsAsFactors = F)
# Calculating variance
variance = var(myData$Age)
print(variance)
Output:
[1] 48.21217
Standard Deviation
It is defined as the square root of the variance. It is being calculated by finding the Mean, then subtract each number from the Mean
which is also known as average and square the result. Adding all the values and then divide by the no of terms followed the square root.
where,
N = number of terms
u = Mean
R
# R program to illustrate
# Descriptive Analysis
# Import the data using read.csv()
myData = read.csv("CardioGoodFitness.csv",
stringsAsFactors = F)
std = sd(myData$Age)
print(std)
Output:
[1] 6.943498
Normal Distribution in R
Last Updated : 13 Apr, 2020
Normal Distribution is a probability function used in statistics that tells about how the data values are distributed. It is the most
important probability distribution function used in statistics because of its advantages in real case scenarios. For example, the height of
the population, shoe size, IQ level, rolling a dice, and many more. It is generally observed that data distribution is normal when there is a
random collection of data from independent sources. The graph produced after plotting the value of the variable on x-axis and count of
the value on y-axis is bell-shaped curve graph. The graph signifies that the peak point is the mean of the data set and half of the values
of data set lie on the left side of the mean and other half lies on the right part of the mean telling about the distribution of the values. The
graph is symmetric distribution. In R, there are 4 built-in functions to generate normal distribution:
dnorm()
– x represents the data set of values – mean(x) represents the mean of data set x. It’s default value is 0.
– sd(x) represents the standard deviation of data set x. It’s default value is 1.
dnorm()
dnorm() function in R programming measures density function of distribution. In statistics, it is measured by below formula-
where, is mean and is standard deviation. Syntax :
png(file="dnormExample.png")
plot(x, y)
dev.off()
Output:
pnorm()
pnorm() function is the cumulative distribution function which measures the probability that a random number X takes a value less
than or equal to x i.e., in statistics it is given by-
Syntax:
png(file="pnormExample.png")
# Plot the graph.
plot(x, y)
dev.off()
Output :
qnorm()
qnorm() function is the inverse of pnorm() function. It takes the probability value and gives output which corresponds to the
probability value. It is useful in finding the percentiles of a normal distribution. Syntax:
# incrementing by 0.02.
png(file = "qnormExample.png")
plot(x, y)
dev.off()
Output:
rnorm()
rnorm() function in R programming is used to generate a vector of random numbers which are normally distributed. Syntax:
png(file = "rnormExample.png")
hist(x, breaks=50)
dev.off()
Output :
Binomial distribution in R is a probability distribution used in statistics. The binomial distribution is a discrete distribution and has only
two outcomes i.e. success or failure. All its trials are independent, the probability of success remains the same and the previous outcome
does not affect the next outcome. The outcomes from different trials are independent. Binomial distribution helps us to find the individual
probabilities as well as cumulative probabilities over a certain range.
It is also used in many real-life scenarios such as in determining whether a particular lottery ticket has won or not, whether a drug is able
to cure a person or not, it can be used to determine the number of heads or tails in a finite number of tosses, for analyzing the outcome
of a die, etc.
Formula:
dbinom()
dbinom(k, n, p)
pbinom()
pbinom(k, n, p)
where n is total number of trials, p is probability of success, k is the value at which the probability has to be found out.
qbinom()
qbinom(P, n, p)
Where P is the probability, n is the total number of trials and p is the probability of success.
rbinom()
rbinom(n, N, p)
Where n is numbers of observations, N is the total number of trials, p is the probability of success.
dbinom() Function
This function is used to find probability at a particular value for a data that follows binomial distribution i.e. it finds:
P(X = k)
Syntax:
dbinom(k, n, p)
Example:
data.frame(x, probs)
Output :
> dbinom(3, size = 13, prob = 1/6)
[1] 0.2138454
> probabilities = dbinom(x = c(0:10), size = 10, prob = 1/6)
> data.frame(probabilities)
probabilities
1 1.615056e-01
2 3.230112e-01
3 2.907100e-01
4 1.550454e-01
5 5.426588e-02
6 1.302381e-02
7 2.170635e-03
8 2.480726e-04
9 1.860544e-05
10 8.269086e-07
11 1.653817e-08
The above piece of code first finds the probability at k=3, then it displays a data frame containing the probability distribution for k from 0
to 10 which in this case is 0 to n.
pbinom() Function
The function pbinom() is used to find the cumulative probability of a data following binomial distribution till a given value ie it finds
P(X <= k)
Syntax:
pbinom(k, n, p)
Example:
Output :
> pbinom(3, size = 13, prob = 1/6)
[1] 0.8419226
qbinom() Function
This function is used to find the nth quantile, that is if P(x <= k) is given, it finds k.
Syntax:
qbinom(P, n, p)
Example:
Output :
> qbinom(0.8419226, size = 13, prob = 1/6)
[1] 3
rbinom() Function
This function generates n random variables of a particular probability.
Syntax:
rbinom(n, N, p)
Example:
Output:
> rbinom(8, size = 13, prob = 1/6)
[1] 1 1 2 1 4 0 2 3
Poisson Distribution In R
Last Updated : 02 Feb, 2024
Poisson distribution is a probability distribution that expresses the number of events occurring in a fixed interval of time or space, given a
constant average rate. This distribution is particularly useful when dealing with rare events or incidents that happen independently. R
provides powerful tools for statistical analysis, making it an excellent choice for working with probability distributions like Poisson.
Poisson Distribution
Poisson distribution is a probability distribution that describes the number of events that occur within a fixed interval of time or space. If λ
is the mean occurrence per interval, then the probability of having x occurrences within a given interval is:
R
# Probability of having thirty or fewer
inquiries
print(probability_30_or_less)
probability_30_or_more <- 1 -
probability_30_or_less
print(probability_30_or_more)
Output:
[1] 0.9865253
[1] 0.01347468
The probability of having thirty or fewer inquiries (P(X≤30)) is approximately 98.65%
The probability of having thirty or more inquiries (P(X≥30)) is approximately 1.35%.
This means that, in a minute, there is a high likelihood (98.65%) that the number of customer inquiries will be thirty or fewer, and a
low likelihood (1.35%) that it will be thirty or more, based on the given average rate.
Characteristics of Poisson distribution
1. Events Occur Independently: Poisson distribution assumes that events occur independently of each other. This means the
occurrence of one event does not affect the occurrence of another.
2. Constant Average Rate: The events happen at a constant average rate over a fixed interval of time or space.
3. Discrete Nature: The distribution is discrete, meaning it deals with whole numbers (0, 1, 2, …) as it represents the count of events.
Poisson Functions in R Programming
In R, several built-in functions to work with the Poisson distribution. The key functions include `dpois()`, `ppois()`, `qpois()`, and `rpois()`,
which correspond to the probability density function (PMF), cumulative distribution function (CDF), quantile function, and random number
generation, respectively.
1. dpois(x, lambda)
This function calculates the probability mass function (PMF) of the Poisson distribution.
It gives the probability of observing exactly `x` events in a Poisson distribution with mean (`lambda`).
R
x <- 3
lambda <- 2
print(probability)
Output:
[1] 0.180447
2. ppois(q, lambda)
This function calculates the cumulative distribution function (CDF) of the Poisson distribution.
It gives the probability of observing fewer than or equal to `q` events.
R
q <- 2
lambda <- 3
print(cumulative_probability)
Output:
[1] 0.4231901
3. qpois(p, lambda)
This function calculates the quantile function of the Poisson distribution.
It returns the smallest integer `q` such that `ppois(q, lambda)` is greater than or equal to `p`.
R
p <- 0.8
lambda <- 4
print(quantile_value)
Output:
[1] 6
4. rpois(n, lambda)
This function generates random samples from a Poisson distribution.
It produces `n` random values representing the count of events, where the mean is specified by `lambda`.
R
n <- 10
lambda <- 5
print(random_samples)
Output:
[1] 8 3 9 6 4 4 2 4 2 8
These functions are part of the base R package and are helpful for performing various operations related to the Poisson distribution.
R - Linear Regression
Regression analysis is a very widely used statistical tool to establish a relationship model
between two variables. One of these variable is called predictor variable whose value is
gathered through experiments. The other variable is called response variable whose value is
derived from the predictor variable.
In Linear Regression these two variables are related through an equation, where exponent
(power) of both these variables is 1. Mathematically a linear relationship represents a straight
line when plotted as a graph. A non-linear relationship where the exponent of any variable is
not equal to 1 creates a curve.
Carry out the experiment of gathering a sample of observed values of height and
corresponding weight.
Create a relationship model using the lm() functions in R.
Find the coefficients from the model created and create the mathematical equation
using these
Get a summary of the relationship model to know the average error in prediction.
Also called residuals.
To predict the weight of new persons, use the predict() function in R.
Input Data
# Values of weight.
63, 81, 56, 91, 47, 57, 76, 72, 62, 48
lm() Function
This function creates the relationship model between the predictor and the response variable.
Syntax
lm(formula,data)
Live Demo
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
print(relation)
Coefficients:
(Intercept) x
-38.4551 0.6746
Live Demo
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
print(summary(relation))
Residuals:
Min 1Q Median 3Q Max
-6.3002 -1.6629 0.0412 1.8944 3.9775
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -38.45509 8.04901 -4.778 0.00139 **
x 0.67461 0.05191 12.997 1.16e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
predict() Function
Syntax
object is the formula which is already created using the lm() function.
newdata is the vector containing the new value for predictor variable.
Predict the weight of new persons
Live Demo
# The predictor vector.
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
Live Demo
# Create the predictor and response variable.
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
relation <- lm(y~x)
Mean
It is calculated by taking the sum of the values and dividing with
the number of values in a data series.
The function mean() is used to calculate this in R.
Syntax
Live Demo
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
# Find Mean.
result.mean <- mean(x)
print(result.mean)
When trim = 0.3, 3 values from each end will be dropped from the
calculations to find mean.
In this case the sorted vector is (−21, −5, 2, 3, 4.2, 7, 8, 12, 18,
54) and the values removed from the vector for calculating mean
are (−21,−5,2) from left and (12,18,54) from right.
Live Demo
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
# Find Mean.
result.mean <- mean(x,trim = 0.3)
print(result.mean)
Applying NA Option
If there are missing values, then the mean function returns NA.
Live Demo
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5,NA)
# Find mean.
result.mean <- mean(x)
print(result.mean)
Median
The middle most value in a data series is called the median.
The median() function is used in R to calculate this value.
Syntax
Live Demo
# Create the vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
Mode
The mode is the value that has highest number of occurrences in
a set of data. Unike mean and median, mode can have both
numeric and character data.
Live Demo
# Create the function.
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
R - Binomial Distribution
Previous
Next
x is a vector of numbers.
p is a vector of probabilities.
n is number of observations.
size is the number of trials.
prob is the probability of success of each trial.
dbinom()
This function gives the probability density distribution at each
point.
Live Demo
# Create a sample of 50 numbers which are incremented
by 1.
x <- seq(0,50,by = 1)
pbinom()
This function gives the cumulative probability of an event. It is a
single value representing the probability.
Live Demo
# Probability of getting 26 or less heads from a 51
tosses of a coin.
x <- pbinom(26,51,0.5)
print(x)
qbinom()
This function takes the probability value and gives a number
whose cumulative value matches the probability value.
Live Demo
# How many heads will have a probability of 0.25 will
come out when a coin
# is tossed 51 times.
x <- qbinom(0.25,51,1/2)
print(x)
rbinom()
This function generates required number of random values of
given probability from a given sample.
Live Demo
# Find 8 random values from a sample of 150 with
probability of 0.4.
x <- rbinom(8,150,.4)
print(x)
When we execute the above code, it produces the following result
−
[1] 58 61 59 66 55 60 61 67
R - Poisson Regression
Previous
Next
Input Data
Live Demo
input <- warpbreaks
print(head(input))
Deviance Residuals:
Min 1Q Median 3Q Max
-3.6871 -1.6503 -0.4269 1.1902 4.2616
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.69196 0.04541 81.302 < 2e-16 ***
woolB -0.20599 0.05157 -3.994 6.49e-05 ***
tensionM -0.32132 0.06027 -5.332 9.73e-08 ***
tensionH -0.51849 0.06396 -8.107 5.21e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1