Descriptive Analysis in R Programming
In Descriptive statistics in R Programming Language, we describe our data with the help of
various representative methods using charts, graphs, tables, excel files, etc. In the descriptive
analysis, we describe our data in some manner and present it in a meaningful way so that it can be
easily understood.
Most of the time it is performed on small data sets and this analysis helps us a lot to predict some
future trends based on the current findings. Some measures that are used to describe a data set are
measures of central tendency and measures of variability or dispersion.
Process of Descriptive Statistics in R
The measure of central tendency
Measure of variability
Measure of central tendency
It represents the whole set of data by a single value. It gives us the location of central points.
There are three main measures of central tendency:
Mean
Mode
Median
Measure of variability
In Descriptive statistics in R measure of variability is known as the spread of data or how well is
our data is distributed. The most common variability measures are:
Range
Variance
Standard deviation
Need of Descriptive Statistics in R
Descriptive Analysis helps us to understand our data and is a very important part of Machine
Learning. This is due to Machine Learning being all about making predictions. On the other hand,
statistics is all about drawing conclusions from data, which is a necessary initial step for Machine
Learning. Let’s do this descriptive analysis in R.
Descriptive Analysis in R
Descriptive analyses consist of describing simply the data using some summary statistics and
graphics. Here, we’ll describe how to compute summary statistics using R software.
Import your data into R:
Before doing any computation, first of all, we need to prepare our data, save our data in
external .txt or .csv files and it’s a best practice to save the file in the current directory. After that
import, your data into R as follow:
R
# R program to illustrate
# Descriptive Analysis
# Import the data using read.csv()
myData = read.csv("CardioGoodFitness.csv",
stringsAsFactors = F)
# Print the first 6 rows
print(head(myData))
Output:
Product Age Gender Education MaritalStatus Usage Fitness Income Miles
1 TM195 18 Male 14 Single 3 4 29562 112
2 TM195 19 Male 15 Single 2 3 31836 75
3 TM195 19 Female 14 Partnered 4 3 30699 66
4 TM195 19 Male 12 Single 3 3 32973 85
5 TM195 20 Male 13 Partnered 4 2 35247 47
6 TM195 20 Female 14 Partnered 3 3 32973 66
R functions for computing descriptive analysis:
Histogram of Age Distribution
R
library(ggplot2)
ggplot(myData, aes(x = Age)) +
geom_histogram(binwidth = 2, fill = "blue", color = "red", alpha = 0.8) +
labs(title = "Age Distribution", x = "Age", y = "Frequency")
Output:
Descriptive Analysis in R Programming
The ggplot2 library to create a histogram of the ‘Age’ variable from the ‘myData’ dataset. The
histogram bins have a width of 2, and the bars are filled with a teal color with a light gray border.
The resulting visualization shows the distribution of ages in the dataset.
Boxplot of Miles by Gender
R
ggplot(myData, aes(x = Gender, y = Miles, fill = Gender)) +
geom_boxplot() +
labs(title = "Miles Distribution by Gender", x = "Gender", y = "Miles") +
theme_minimal()
Output:
Descriptive Analysis in R Programming
Mean
It is the sum of observations divided by the total number of observations. It is also defined as
average which is the sum divided by count.
where n = number of terms
R
# R program to illustrate
# Descriptive Analysis
# Import the data using read.csv()
myData = read.csv("CardioGoodFitness.csv",
stringsAsFactors = F)
# Compute the mean value
mean = mean(myData$Age)
print(mean)
Output:
[1] 28.78889
Median
It is the middle value of the data set. It splits the data into two halves. If the number of elements
in the data set is odd then the center element is median and if it is even then the median would be
the average of two central elements.
where n = number of terms
R
# R program to illustrate
# Descriptive Analysis
# Import the data using read.csv()
myData = read.csv("CardioGoodFitness.csv",
stringsAsFactors = F)
# Compute the median value
median = median(myData$Age)
print(median)
Output:
[1] 26
Mode
It is the value that has the highest frequency in the given data set. The data set may have no mode
if the frequency of all data points is the same. Also, we can have more than one mode if we
encounter two or more data points having the same frequency.
R
# R program to illustrate
# Descriptive Analysis
# Import the library
library(modeest)
# Import the data using read.csv()
myData = read.csv("CardioGoodFitness.csv",
stringsAsFactors = F)
# Compute the mode value
mode = mfv(myData$Age)
print(mode)
Output:
[1] 25
Range
The range describes the difference between the largest and smallest data point in our data set. The
bigger the range, the more is the spread of data and vice versa.
Range = Largest data value – smallest data value
R
# R program to illustrate
# Descriptive Analysis
# Import the data using read.csv()
myData = read.csv("CardioGoodFitness.csv",
stringsAsFactors = F)
# Calculate the maximum
max = max(myData$Age)
# Calculate the minimum
min = min(myData$Age)
# Calculate the range
range = max - min
cat("Range is:\n")
print(range)
# Alternate method to get min and max
r = range(myData$Age)
print(r)
Output:
Range is:
[1] 32
[1] 18 50
Variance
It is defined as an average squared deviation from the mean. It is being calculated by finding the
difference between every data point and the average which is also known as the mean, squaring
them, adding all of them, and then dividing by the number of data points present in our data set.
where,
N = number of terms
u = Mean
R
# R program to illustrate
# Descriptive Analysis
# Import the data using read.csv()
myData = read.csv("CardioGoodFitness.csv",
stringsAsFactors = F)
# Calculating variance
variance = var(myData$Age)
print(variance)
Output:
[1] 48.21217
Standard Deviation
It is defined as the square root of the variance. It is being calculated by finding the Mean, then
subtract each number from the Mean which is also known as average and square the result.
Adding all the values and then divide by the no of terms followed the square root.
where,
N = number of terms
u = Mean
R
# R program to illustrate
# Descriptive Analysis
# Import the data using read.csv()
myData = read.csv("CardioGoodFitness.csv", stringsAsFactors = F)
# Calculating Standard deviation
std = sd(myData$Age)
print(std)
Output:
[1] 6.943498
R - Linear Regression
Regression analysis is a very widely used statistical tool to establish a relationship model between two variables.
One of these variable is called predictor variable whose value is gathered through experiments. The other
variable is called response variable whose value is derived from the predictor variable.
In Linear Regression these two variables are related through an equation, where exponent (power) of both these
variables is 1. Mathematically a linear relationship represents a straight line when plotted as a graph. A non-
linear relationship where the exponent of any variable is not equal to 1 creates a curve.
The general mathematical equation for a linear regression is −
y = ax + b
Following is the description of the parameters used −
y is the response variable.
x is the predictor variable.
a and b are constants which are called the coefficients.
Steps to Establish a Regression
A simple example of regression is predicting weight of a person when his height is known. To do this we need
to have the relationship between height and weight of a person.
The steps to create the relationship is −
Carry out the experiment of gathering a sample of observed values of height and corresponding weight.
Create a relationship model using the lm() functions in R.
Find the coefficients from the model created and create the mathematical equation using these
Get a summary of the relationship model to know the average error in prediction. Also
called residuals.
To predict the weight of new persons, use the predict() function in R.
Input Data
Below is the sample data representing the observations −
# Values of height
151, 174, 138, 186, 128, 136, 179, 163, 152, 131
# Values of weight.
63, 81, 56, 91, 47, 57, 76, 72, 62, 48
lm() Function
This function creates the relationship model between the predictor and the response variable.
Syntax
The basic syntax for lm() function in linear regression is −
lm(formula,data)
Following is the description of the parameters used −
formula is a symbol presenting the relation between x and y.
data is the vector on which the formula will be applied.
Create Relationship Model & get the Coefficients
Live Demo
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
# Apply the lm() function.
relation <- lm(y~x)
print(relation)
When we execute the above code, it produces the following result −
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
-38.4551 0.6746
Get the mmary of the RelationshipSu
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
# Apply the lm() function.
relation <- lm(y~x)
print(summary(relation))
When we execute the above code, it produces the following result −
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-6.3002 -1.6629 0.0412 1.8944 3.9775
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -38.45509 8.04901 -4.778 0.00139 **
x 0.67461 0.05191 12.997 1.16e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.253 on 8 degrees of freedom
Multiple R-squared: 0.9548, Adjusted R-squared: 0.9491
F-statistic: 168.9 on 1 and 8 DF, p-value: 1.164e-06
predict() Function
Syntax
The basic syntax for predict() in linear regression is −
predict(object, newdata)
Following is the description of the parameters used −
object is the formula which is already created using the lm() function.
newdata is the vector containing the new value for predictor variable.
Predict the weight of new persons
# The predictor vector.
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
# The resposne vector.
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
# Apply the lm() function.
relation <- lm(y~x)
# Find weight of a person with height 170.
a <- data.frame(x = 170)
result <- predict(relation,a)
print(result)
When we execute the above code, it produces the following result −
1
76.22869
Visualize the Regression Graphically
# Create the predictor and response variable.
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
relation <- lm(y~x)
# Give the chart file a name.
png(file = "linearregression.png")
# Plot the chart.
plot(y,x,col = "blue",main = "Height & Weight Regression",
abline(lm(x~y)),cex = 1.3,pch = 16,xlab = "Weight in Kg",ylab = "Height in cm")
# Save the file.
dev.off()
When we execute the above code, it produces the following result −
R - Normal Distribution
In a random collection of data from independent sources, it is generally observed that the distribution of data is
normal. Which means, on plotting a graph with the value of the variable in the horizontal axis and the count of
the values in the vertical axis we get a bell shape curve. The center of the curve represents the mean of the data
set. In the graph, fifty percent of values lie to the left of the mean and the other fifty percent lie to the right of the
graph. This is referred as normal distribution in statistics.
R has four in built functions to generate normal distribution. They are described below.
dnorm(x, mean, sd)
pnorm(x, mean, sd)
qnorm(p, mean, sd)
rnorm(n, mean, sd)
Following is the description of the parameters used in above functions −
x is a vector of numbers.
p is a vector of probabilities.
n is number of observations(sample size).
mean is the mean value of the sample data. It's default value is zero.
sd is the standard deviation. It's default value is 1.
dnorm()
This function gives height of the probability distribution at each point for a given mean and standard deviation.
Live Demo
# Create a sequence of numbers between -10 and 10 incrementing by 0.1.
x <- seq(-10, 10, by = .1)
# Choose the mean as 2.5 and standard deviation as 0.5.
y <- dnorm(x, mean = 2.5, sd = 0.5)
# Give the chart file a name.
png(file = "dnorm.png")
plot(x,y)
# Save the file.
dev.off()
When we execute the above code, it produces the following result −
pnorm()
This function gives the probability of a normally distributed random number to be less that the value of a given
number. It is also called "Cumulative Distribution Function".
Live Demo
# Create a sequence of numbers between -10 and 10 incrementing by 0.2.
x <- seq(-10,10,by = .2)
# Choose the mean as 2.5 and standard deviation as 2.
y <- pnorm(x, mean = 2.5, sd = 2)
# Give the chart file a name.
png(file = "pnorm.png")
# Plot the graph.
plot(x,y)
# Save the file.
dev.off()
When we execute the above code, it produces the following result −
qnorm()
This function takes the probability value and gives a number whose cumulative value matches the probability
value.
# Create a sequence of probability values incrementing by 0.02.
x <- seq(0, 1, by = 0.02)
# Choose the mean as 2 and standard deviation as 3.
y <- qnorm(x, mean = 2, sd = 1)
# Give the chart file a name.
png(file = "qnorm.png")
# Plot the graph.
plot(x,y)
# Save the file.
dev.off()
When we execute the above code, it produces the following result −
rnorm()
This function is used to generate random numbers whose distribution is normal. It takes the sample size as input
and generates that many random numbers. We draw a histogram to show the distribution of the generated
numbers.
# Create a sample of 50 numbers which are normally distributed.
y <- rnorm(50)
# Give the chart file a name.
png(file = "rnorm.png")
# Plot the histogram for this sample.
hist(y, main = "Normal DIstribution")
# Save the file.
dev.off()
When we execute the above code, it produces the following result −
Binomial Distribution in R Programming
++++
Binomial distribution in R is a probability distribution used in statistics. The binomial
distribution is a discrete distribution and has only two outcomes i.e. success or failure. All its
trials are independent, the probability of success remains the same and the previous outcome
does not affect the next outcome. The outcomes from different trials are independent. Binomial
distribution helps us to find the individual probabilities as well as cumulative probabilities over
a certain range.
It is also used in many real-life scenarios such as in determining whether a particular lottery
ticket has won or not, whether a drug is able to cure a person or not, it can be used to determine
the number of heads or tails in a finite number of tosses, for analyzing the outcome of a die, etc.
Formula:
Functions for Binomial Distribution
We have four functions for handling binomial distribution in R namely:
dbinom()
dbinom(k, n, p)
pbinom()
pbinom(k, n, p)
where n is total number of trials, p is probability of success, k is the value at which the
probability has to be found out.
qbinom()
qbinom(P, n, p)
Where P is the probability, n is the total number of trials and p is the probability of success.
rbinom()
rbinom(n, N, p)
Where n is numbers of observations, N is the total number of trials, p is the probability of
success.
dbinom() Function
This function is used to find probability at a particular value for a data that follows binomial
distribution i.e. it finds:
P(X = k)
Syntax:
dbinom(k, n, p)
Example:
dbinom(3, size = 13, prob = 1 / 6)
probabilities <- dbinom(x = c(0:10), size = 10, prob = 1 / 6)
data.frame(x, probs)
plot(0:10, probabilities, type = "l")
Output :
> dbinom(3, size = 13, prob = 1/6)
[1] 0.2138454
> probabilities = dbinom(x = c(0:10), size = 10, prob = 1/6)
> data.frame(probabilities)
probabilities
1 1.615056e-01
2 3.230112e-01
3 2.907100e-01
4 1.550454e-01
5 5.426588e-02
6 1.302381e-02
7 2.170635e-03
8 2.480726e-04
9 1.860544e-05
10 8.269086e-07
11 1.653817e-08
The above piece of code first finds the probability at k=3, then it displays a data frame
containing the probability distribution for k from 0 to 10 which in this case is 0 to n.
pbinom() Function
The function pbinom() is used to find the cumulative probability of a data following binomial
distribution till a given value ie it finds
P(X <= k)
Syntax:
pbinom(k, n, p)
Example:
pbinom(3, size = 13, prob = 1 / 6)
plot(0:10, pbinom(0:10, size = 10, prob = 1 / 6), type = "l")
Output :
> pbinom(3, size = 13, prob = 1/6)
[1] 0.8419226
qbinom() Function
This function is used to find the nth quantile, that is if P(x <= k) is given, it finds k.
Syntax:
qbinom(P, n, p)
Example:
qbinom(0.8419226, size = 13, prob = 1 / 6)
x <- seq(0, 1, by = 0.1)
y <- qbinom(x, size = 13, prob = 1 / 6)
plot(x, y, type = 'l')
Output :
> qbinom(0.8419226, size = 13, prob = 1/6)
[1] 3
rbinom() Function
This function generates n random variables of a particular probability.
Syntax:
rbinom(n, N, p)
Example:
rbinom(8, size = 13, prob = 1 / 6)
hist(rbinom(8, size = 13, prob = 1 / 6))
Output:
> rbinom(8, size = 13, prob = 1/6)
[1] 1 1 2 1 4 0 2 3
Poisson distribution in R
The Poisson distribution is a discrete distribution that counts the number of events in a Poisson
process. In this tutorial we will review the dpois, ppois, qpois and rpois functions to work with
the Poisson distribution in R.
Poisson distribution
A dpois, ppois, qpois, and rpois in R
Here are some examples of cases where you might use each of these functions.
dpois
The dpois function finds the probability that a certain number of successes
occur based on an average rate of success, using the following syntax:
dpois(x, lambda)
where:
x: number of successes
lambda: average rate of success
Here’s an example of when you might use this function in practice:
It is known that a certain website makes 10 sales per hour. In a given hour,
what is the probability that the site makes exactly 8 sales?
dpois(x=8, lambda=10)
#0.112599
The probability that the site makes exactly 8 sales is 0.112599.
ppois
The ppois function finds the probability that a certain number of successes or
less occur based on an average rate of success, using the following syntax:
ppois(q, lambda)
where:
q: number of successes
lambda: average rate of success
Here’s are a couple examples of when you might use this function in practice:
It is known that a certain website makes 10 sales per hour. In a given hour,
what is the probability that the site makes 8 sales or less?
ppois(q=8, lambda=10)
#0.3328197
The probability that the site makes 8 sales or less in a given hour is 0.3328197.
It is known that a certain website makes 10 sales per hour. In a given hour,
what is the probability that the site makes more than 8 sales?
1 - ppois(q=8, lambda=10)
#0.6671803
The probability that the site makes more than 8 sales in a given hour is 0.66718
03.
qpois
The qpois function finds the number of successes that corresponds to a certain
percentile based on an average rate of success, using the following syntax:
qpois(p, lambda)
where:
p: percentile
lambda: average rate of success
Here’s an example of when you might use this function in practice:
It is known that a certain website makes 10 sales per hour. How many sales
would the site need to make to be at the 90th percentile for sales in an
hour?
qpois(p=.90, lambda=10)
#14
A site would need to make 14 sales to be at the 90th percentile for number of
sales in an hour.
rpois
The rpois function generates a list of random variables that follow a Poisson
distribution with a certain average rate of success, using the following syntax:
rpois(n, lambda)
where:
n: number of random variables to generate
lambda: average rate of success
Here’s an example of when you might use this function in practice:
Generate a list of 15 random variables that follow a Poisson distribution
with a rate of success equal to 10.
rpois(n=15, lambda=10)
# [1] 13 8 8 20 8 10 8 10 13 10 12 8 10 10 6
Since these numbers are generated randomly, the rpois() function will produce
different numbers each time. If you want to create a reproducible example, be
sure to use the set.seed() command.
POISSON Distribution in R ▷ [dpois, ppois, qpois and rpois functions] (r-coder.com)