0% found this document useful (0 votes)
13 views35 pages

Unit 4

Uploaded by

gayathrinaik12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views35 pages

Unit 4

Uploaded by

gayathrinaik12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 35

Descriptive Analysis in R Programming

Last Updated : 23 May, 2024



In Descriptive statistics in R Programming Language, we describe our data with the help of various representative methods using charts,
graphs, tables, excel files, etc. In the descriptive analysis, we describe our data in some manner and present it in a meaningful way so
that it can be easily understood.
Most of the time it is performed on small data sets and this analysis helps us a lot to predict some future trends based on the current
findings. Some measures that are used to describe a data set are measures of central tendency and measures of variability or
dispersion.
Process of Descriptive Statistics in R
 The measure of central tendency
 Measure of variability

Measure of central tendency


It represents the whole set of data by a single value. It gives us the location of central points. There are three main measures of central
tendency:
 Mean
 Mode
 Median

Measure of variability
In Descriptive statistics in R measure of variability is known as the spread of data or how well is our data is distributed. The most
common variability measures are:
 Range
 Variance
 Standard deviation
Need of Descriptive Statistics in R
Descriptive Analysis helps us to understand our data and is a very important part of Machine Learning. This is due to Machine Learning
being all about making predictions. On the other hand, statistics is all about drawing conclusions from data, which is a necessary initial
step for Machine Learning. Let’s do this descriptive analysis in R.
Descriptive Analysis in R
Descriptive analyses consist of describing simply the data using some summary statistics and graphics. Here, we’ll describe how to
compute summary statistics using R software.
Import your data into R:
Before doing any computation, first of all, we need to prepare our data, save our data in external .txt or .csv files and it’s a best practice
to save the file in the current directory. After that import, your data into R as follow:
R

# R program to illustrate

# Descriptive Analysis

# Import the data using read.csv()

myData = read.csv("CardioGoodFitness.csv",

stringsAsFactors = F)

# Print the first 6 rows

print(head(myData))

Output:
Product Age Gender Education MaritalStatus Usage Fitness Income Miles
1 TM195 18 Male 14 Single 3 4 29562 112
2 TM195 19 Male 15 Single 2 3 31836 75
3 TM195 19 Female 14 Partnered 4 3 30699 66
4 TM195 19 Male 12 Single 3 3 32973 85
5 TM195 20 Male 13 Partnered 4 2 35247 47
6 TM195 20 Female 14 Partnered 3 3 32973 66
R functions for computing descriptive analysis:
Histogram of Age Distribution
R

library(ggplot2)

ggplot(myData, aes(x = Age)) +

geom_histogram(binwidth = 2, fill = "blue",


color = "red", alpha = 0.8) +

labs(title = "Age Distribution", x = "Age", y


= "Frequency")

Output:

Descriptive Analysis in R Programming

The ggplot2 library to create a histogram of the ‘Age’ variable from the ‘myData’ dataset. The histogram bins have a width of 2, and the
bars are filled with a teal color with a light gray border. The resulting visualization shows the distribution of ages in the dataset.
Boxplot of Miles by Gender
R

ggplot(myData, aes(x = Gender, y = Miles, fill


= Gender)) +

geom_boxplot() +

labs(title = "Miles Distribution by Gender",


x = "Gender", y = "Miles") +

theme_minimal()

Output:
Descriptive Analysis in R Programming

We create a boxplot visualizing the distribution of ‘Miles’ run, segmented by ‘Gender’ from the ‘myData’ dataset. Each boxplot represents
the interquartile range (IQR) of Miles for each gender. The plot is titled “Miles Distribution by Gender,” with ‘Gender’ on the x-axis and
‘Miles’ on the y-axis. The plot is styled with a minimal theme.
Bar Chart of Education Levels
R

ggplot(myData, aes(x = factor(Education), fill


= factor(Education))) +

geom_bar() +

labs(title = "Education Distribution", x =


"Education Level", y = "Count") +

theme_minimal()

Output:
Descriptive Analysis in R Programming

We generate a bar chart illustrating the distribution of ‘Education’ levels from the ‘myData’ dataset. Each bar represents the count of
observations for each education level. The chart is titled “Education Distribution,” with ‘Education Level’ on the x-axis and ‘Count’ on the
y-axis. The visualization adopts a minimal theme for a clean and simple presentation.
Mean
It is the sum of observations divided by the total number of observations. It is also defined as average which is the sum divided by count.

where n = number of terms


R

# R program to illustrate

# Descriptive Analysis

# Import the data using read.csv()

myData = read.csv("CardioGoodFitness.csv",

stringsAsFactors = F)

# Compute the mean value

mean = mean(myData$Age)

print(mean)

Output:
[1] 28.78889
Median
It is the middle value of the data set. It splits the data into two halves. If the number of elements in the data set is odd then the center
element is median and if it is even then the median would be the average of two central elements.

where n = number of terms


R

# R program to illustrate

# Descriptive Analysis

# Import the data using read.csv()

myData = read.csv("CardioGoodFitness.csv",

stringsAsFactors = F)

# Compute the median value

median = median(myData$Age)

print(median)

Output:
[1] 26
Mode
It is the value that has the highest frequency in the given data set. The data set may have no mode if the frequency of all data points is
the same. Also, we can have more than one mode if we encounter two or more data points having the same frequency.
R

# R program to illustrate

# Descriptive Analysis

# Import the library

library(modeest)

# Import the data using read.csv()

myData = read.csv("CardioGoodFitness.csv",

stringsAsFactors = F)
# Compute the mode value

mode = mfv(myData$Age)

print(mode)

Output:
[1] 25
Range
The range describes the difference between the largest and smallest data point in our data set. The bigger the range, the more is the
spread of data and vice versa.
Range = Largest data value – smallest data value

# R program to illustrate

# Descriptive Analysis

# Import the data using read.csv()

myData = read.csv("CardioGoodFitness.csv",

stringsAsFactors = F)

# Calculate the maximum

max = max(myData$Age)

# Calculate the minimum

min = min(myData$Age)

# Calculate the range

range = max - min

cat("Range is:\n")

print(range)

# Alternate method to get min and max

r = range(myData$Age)

print(r)

Output:
Range is:
[1] 32
[1] 18 50
Variance
It is defined as an average squared deviation from the mean. It is being calculated by finding the difference between every data point
and the average which is also known as the mean, squaring them, adding all of them, and then dividing by the number of data points
present in our data set.
where,
N = number of terms
u = Mean
R

# R program to illustrate

# Descriptive Analysis

# Import the data using read.csv()

myData = read.csv("CardioGoodFitness.csv",

stringsAsFactors = F)

# Calculating variance

variance = var(myData$Age)

print(variance)

Output:
[1] 48.21217
Standard Deviation
It is defined as the square root of the variance. It is being calculated by finding the Mean, then subtract each number from the Mean
which is also known as average and square the result. Adding all the values and then divide by the no of terms followed the square root.

where,
N = number of terms
u = Mean
R

# R program to illustrate

# Descriptive Analysis
# Import the data using read.csv()

myData = read.csv("CardioGoodFitness.csv",
stringsAsFactors = F)

# Calculating Standard deviation

std = sd(myData$Age)

print(std)

Output:
[1] 6.943498
Normal Distribution in R
Last Updated : 13 Apr, 2020



Normal Distribution is a probability function used in statistics that tells about how the data values are distributed. It is the most
important probability distribution function used in statistics because of its advantages in real case scenarios. For example, the height of
the population, shoe size, IQ level, rolling a dice, and many more. It is generally observed that data distribution is normal when there is a
random collection of data from independent sources. The graph produced after plotting the value of the variable on x-axis and count of
the value on y-axis is bell-shaped curve graph. The graph signifies that the peak point is the mean of the data set and half of the values
of data set lie on the left side of the mean and other half lies on the right part of the mean telling about the distribution of the values. The
graph is symmetric distribution. In R, there are 4 built-in functions to generate normal distribution:
 dnorm()

dnorm(x, mean, sd)


 pnorm()

pnorm(x, mean, sd)


 qnorm()

qnorm(p, mean, sd)


 rnorm()

rnorm(n, mean, sd)


where,

– x represents the data set of values – mean(x) represents the mean of data set x. It’s default value is 0.

– sd(x) represents the standard deviation of data set x. It’s default value is 1.

– n is the number of observations. – p is vector of probabilities

Functions To Generate Normal Distribution in R

dnorm()
dnorm() function in R programming measures density function of distribution. In statistics, it is measured by below formula-
where, is mean and is standard deviation. Syntax :

dnorm(x, mean, sd)


Example:

# creating a sequence of values

# between -15 to 15 with a difference of 0.1

x = seq(-15, 15, by=0.1)

y = dnorm(x, mean(x), sd(x))

# output to be present as PNG file

png(file="dnormExample.png")

# Plot the graph.

plot(x, y)

# saving the file

dev.off()
Output:

pnorm()
pnorm() function is the cumulative distribution function which measures the probability that a random number X takes a value less
than or equal to x i.e., in statistics it is given by-

Syntax:

pnorm(x, mean, sd)


Example:

# creating a sequence of values

# between -10 to 10 with a difference of 0.1

x <- seq(-10, 10, by=0.1)

y <- pnorm(x, mean = 2.5, sd = 2)

# output to be present as PNG file

png(file="pnormExample.png")
# Plot the graph.

plot(x, y)

# saving the file

dev.off()

Output :

qnorm()
qnorm() function is the inverse of pnorm() function. It takes the probability value and gives output which corresponds to the
probability value. It is useful in finding the percentiles of a normal distribution. Syntax:

qnorm(p, mean, sd)


Example:

# Create a sequence of probability values

# incrementing by 0.02.

x <- seq(0, 1, by = 0.02)


y <- qnorm(x, mean(x), sd(x))

# output to be present as PNG file

png(file = "qnormExample.png")

# Plot the graph.

plot(x, y)

# Save the file.

dev.off()

Output:

rnorm()
rnorm() function in R programming is used to generate a vector of random numbers which are normally distributed. Syntax:

rnorm(x, mean, sd)


Example:
# Create a vector of 1000 random numbers

# with mean=90 and sd=5

x <- rnorm(10000, mean=90, sd=5)

# output to be present as PNG file

png(file = "rnormExample.png")

# Create the histogram with 50 bars

hist(x, breaks=50)

# Save the file.

dev.off()

Output :

Binomial Distribution in R Programming


Last Updated : 10 May, 2020


Binomial distribution in R is a probability distribution used in statistics. The binomial distribution is a discrete distribution and has only
two outcomes i.e. success or failure. All its trials are independent, the probability of success remains the same and the previous outcome
does not affect the next outcome. The outcomes from different trials are independent. Binomial distribution helps us to find the individual
probabilities as well as cumulative probabilities over a certain range.
It is also used in many real-life scenarios such as in determining whether a particular lottery ticket has won or not, whether a drug is able
to cure a person or not, it can be used to determine the number of heads or tails in a finite number of tosses, for analyzing the outcome
of a die, etc.
Formula:

Functions for Binomial Distribution


We have four functions for handling binomial distribution in R namely:

 dbinom()

dbinom(k, n, p)
 pbinom()

pbinom(k, n, p)
where n is total number of trials, p is probability of success, k is the value at which the probability has to be found out.

 qbinom()

qbinom(P, n, p)
Where P is the probability, n is the total number of trials and p is the probability of success.

 rbinom()

rbinom(n, N, p)
Where n is numbers of observations, N is the total number of trials, p is the probability of success.

dbinom() Function
This function is used to find probability at a particular value for a data that follows binomial distribution i.e. it finds:

P(X = k)
Syntax:
dbinom(k, n, p)
Example:

dbinom(3, size = 13, prob = 1 / 6)

probabilities <- dbinom(x = c(0:10), size = 10,


prob = 1 / 6)

data.frame(x, probs)

plot(0:10, probabilities, type = "l")

Output :
> dbinom(3, size = 13, prob = 1/6)
[1] 0.2138454
> probabilities = dbinom(x = c(0:10), size = 10, prob = 1/6)
> data.frame(probabilities)
probabilities
1 1.615056e-01
2 3.230112e-01
3 2.907100e-01
4 1.550454e-01
5 5.426588e-02
6 1.302381e-02
7 2.170635e-03
8 2.480726e-04
9 1.860544e-05
10 8.269086e-07
11 1.653817e-08

The above piece of code first finds the probability at k=3, then it displays a data frame containing the probability distribution for k from 0
to 10 which in this case is 0 to n.

pbinom() Function
The function pbinom() is used to find the cumulative probability of a data following binomial distribution till a given value ie it finds
P(X <= k)
Syntax:
pbinom(k, n, p)
Example:

pbinom(3, size = 13, prob = 1 / 6)

plot(0:10, pbinom(0:10, size = 10, prob = 1 / 6),


type = "l")

Output :
> pbinom(3, size = 13, prob = 1/6)
[1] 0.8419226

qbinom() Function
This function is used to find the nth quantile, that is if P(x <= k) is given, it finds k.
Syntax:
qbinom(P, n, p)
Example:

qbinom(0.8419226, size = 13, prob = 1 / 6)

x <- seq(0, 1, by = 0.1)


y <- qbinom(x, size = 13, prob = 1 / 6)

plot(x, y, type = 'l')

Output :
> qbinom(0.8419226, size = 13, prob = 1/6)
[1] 3

rbinom() Function
This function generates n random variables of a particular probability.
Syntax:
rbinom(n, N, p)
Example:

rbinom(8, size = 13, prob = 1 / 6)

hist(rbinom(8, size = 13, prob = 1 / 6))

Output:
> rbinom(8, size = 13, prob = 1/6)
[1] 1 1 2 1 4 0 2 3
Poisson Distribution In R
Last Updated : 02 Feb, 2024



Poisson distribution is a probability distribution that expresses the number of events occurring in a fixed interval of time or space, given a
constant average rate. This distribution is particularly useful when dealing with rare events or incidents that happen independently. R
provides powerful tools for statistical analysis, making it an excellent choice for working with probability distributions like Poisson.
Poisson Distribution
Poisson distribution is a probability distribution that describes the number of events that occur within a fixed interval of time or space. If λ
is the mean occurrence per interval, then the probability of having x occurrences within a given interval is:

 P(X=k) represents the probability of observing k events.


 e is the base of the natural logarithm.
 λ is the average rate of event occurrences in a fixed interval.
 k is the actual number of events observed.
 k! denotes the factorial of k, which is the product of all positive integers up to k.
Uses Poisson distribution when
1. Events unfold randomly and autonomously, where the likelihood of one event occurring does not influence the likelihood of another.
2. The average rate of events within a specific timeframe or space, denoted as λ (lambda), is known and presumed to be consistent.
3. When events adhere to a Poisson distribution, λ serves as the singular parameter necessary for determining the probability of a
particular number of events taking place.
The probability of having thirty or more inquiries, we subtract the probability calculated above from 1 . This is because the probability in
the upper tail is complementary to the probability in the lower tail.

 R
# Probability of having thirty or fewer
inquiries

probability_30_or_less <- ppois(30, lambda =


20)

print(probability_30_or_less)

# Probability of having thirty or more


inquiries

probability_30_or_more <- 1 -
probability_30_or_less

print(probability_30_or_more)

Output:
[1] 0.9865253

[1] 0.01347468
 The probability of having thirty or fewer inquiries (P(X≤30)) is approximately 98.65%
 The probability of having thirty or more inquiries (P(X≥30)) is approximately 1.35%.
 This means that, in a minute, there is a high likelihood (98.65%) that the number of customer inquiries will be thirty or fewer, and a
low likelihood (1.35%) that it will be thirty or more, based on the given average rate.
Characteristics of Poisson distribution
1. Events Occur Independently: Poisson distribution assumes that events occur independently of each other. This means the
occurrence of one event does not affect the occurrence of another.
2. Constant Average Rate: The events happen at a constant average rate over a fixed interval of time or space.
3. Discrete Nature: The distribution is discrete, meaning it deals with whole numbers (0, 1, 2, …) as it represents the count of events.
Poisson Functions in R Programming
In R, several built-in functions to work with the Poisson distribution. The key functions include `dpois()`, `ppois()`, `qpois()`, and `rpois()`,
which correspond to the probability density function (PMF), cumulative distribution function (CDF), quantile function, and random number
generation, respectively.
1. dpois(x, lambda)
 This function calculates the probability mass function (PMF) of the Poisson distribution.
 It gives the probability of observing exactly `x` events in a Poisson distribution with mean (`lambda`).

 R

x <- 3

lambda <- 2

probability <- dpois(x, lambda)

print(probability)

Output:
[1] 0.180447
2. ppois(q, lambda)
 This function calculates the cumulative distribution function (CDF) of the Poisson distribution.
 It gives the probability of observing fewer than or equal to `q` events.

 R

q <- 2

lambda <- 3

cumulative_probability <- ppois(q, lambda)

print(cumulative_probability)

Output:
[1] 0.4231901
3. qpois(p, lambda)
 This function calculates the quantile function of the Poisson distribution.
 It returns the smallest integer `q` such that `ppois(q, lambda)` is greater than or equal to `p`.

 R

p <- 0.8

lambda <- 4

quantile_value <- qpois(p, lambda)

print(quantile_value)

Output:
[1] 6
4. rpois(n, lambda)
 This function generates random samples from a Poisson distribution.
 It produces `n` random values representing the count of events, where the mean is specified by `lambda`.

 R

n <- 10

lambda <- 5

random_samples <- rpois(n, lambda)

print(random_samples)

Output:
[1] 8 3 9 6 4 4 2 4 2 8
These functions are part of the base R package and are helpful for performing various operations related to the Poisson distribution.

R - Linear Regression
Regression analysis is a very widely used statistical tool to establish a relationship model
between two variables. One of these variable is called predictor variable whose value is
gathered through experiments. The other variable is called response variable whose value is
derived from the predictor variable.

In Linear Regression these two variables are related through an equation, where exponent
(power) of both these variables is 1. Mathematically a linear relationship represents a straight
line when plotted as a graph. A non-linear relationship where the exponent of any variable is
not equal to 1 creates a curve.

The general mathematical equation for a linear regression is −


y = ax + b

Following is the description of the parameters used −

 y is the response variable.


 x is the predictor variable.
 a and b are constants which are called the coefficients.
Steps to Establish a Regression
A simple example of regression is predicting weight of a person when his height is known.
To do this we need to have the relationship between height and weight of a person.

The steps to create the relationship is −

 Carry out the experiment of gathering a sample of observed values of height and
corresponding weight.
 Create a relationship model using the lm() functions in R.
 Find the coefficients from the model created and create the mathematical equation
using these
 Get a summary of the relationship model to know the average error in prediction.
Also called residuals.
 To predict the weight of new persons, use the predict() function in R.
Input Data

Below is the sample data representing the observations −


# Values of height
151, 174, 138, 186, 128, 136, 179, 163, 152, 131

# Values of weight.
63, 81, 56, 91, 47, 57, 76, 72, 62, 48

lm() Function
This function creates the relationship model between the predictor and the response variable.
Syntax

The basic syntax for lm() function in linear regression is −

lm(formula,data)

Following is the description of the parameters used −

 formula is a symbol presenting the relation between x and y.


 data is the vector on which the formula will be applied.
Create Relationship Model & get the Coefficients

Live Demo
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

# Apply the lm() function.


relation <- lm(y~x)

print(relation)

When we execute the above code, it produces the following result −


Call:
lm(formula = y ~ x)

Coefficients:
(Intercept) x
-38.4551 0.6746

Get the Summary of the Relationship

Live Demo
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

# Apply the lm() function.


relation <- lm(y~x)

print(summary(relation))

When we execute the above code, it produces the following result −


Call:
lm(formula = y ~ x)

Residuals:
Min 1Q Median 3Q Max
-6.3002 -1.6629 0.0412 1.8944 3.9775

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -38.45509 8.04901 -4.778 0.00139 **
x 0.67461 0.05191 12.997 1.16e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.253 on 8 degrees of freedom


Multiple R-squared: 0.9548, Adjusted R-squared: 0.9491
F-statistic: 168.9 on 1 and 8 DF, p-value: 1.164e-06

predict() Function
Syntax

The basic syntax for predict() in linear regression is −


predict(object, newdata)

Following is the description of the parameters used −

 object is the formula which is already created using the lm() function.
 newdata is the vector containing the new value for predictor variable.
Predict the weight of new persons

Live Demo
# The predictor vector.
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)

# The resposne vector.


y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

# Apply the lm() function.


relation <- lm(y~x)
# Find weight of a person with height 170.
a <- data.frame(x = 170)
result <- predict(relation,a)
print(result)

When we execute the above code, it produces the following result −


1
76.22869

Visualize the Regression Graphically

Live Demo
# Create the predictor and response variable.
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
relation <- lm(y~x)

# Give the chart file a name.


png(file = "linearregression.png")

# Plot the chart.


plot(y,x,col = "blue",main = "Height & Weight Regression",
abline(lm(x~y)),cex = 1.3,pch = 16,xlab = "Weight in Kg",ylab = "Height in
cm")

# Save the file.


dev.off()

When we execute the above code, it produces the following result −


Print Page

R - Mean, Median and Mode


Previous
Next

Statistical analysis in R is performed by using many in-built


functions. Most of these functions are part of the R base package.
These functions take R vector as an input along with the
arguments and give the result.

The functions we are discussing in this chapter are mean, median


and mode.

Mean
It is calculated by taking the sum of the values and dividing with
the number of values in a data series.
The function mean() is used to calculate this in R.
Syntax

The basic syntax for calculating mean in R is −


mean(x, trim = 0, na.rm = FALSE, ...)

Following is the description of the parameters used −

 x is the input vector.


 trim is used to drop some observations from both end of the
sorted vector.
 na.rm is used to remove the missing values from the input
vector.
Example

Live Demo
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)

# Find Mean.
result.mean <- mean(x)
print(result.mean)

When we execute the above code, it produces the following result



[1] 8.22

Applying Trim Option


When trim parameter is supplied, the values in the vector get
sorted and then the required numbers of observations are
dropped from calculating the mean.

When trim = 0.3, 3 values from each end will be dropped from the
calculations to find mean.

In this case the sorted vector is (−21, −5, 2, 3, 4.2, 7, 8, 12, 18,
54) and the values removed from the vector for calculating mean
are (−21,−5,2) from left and (12,18,54) from right.

Live Demo
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
# Find Mean.
result.mean <- mean(x,trim = 0.3)
print(result.mean)

When we execute the above code, it produces the following result



[1] 5.55

Applying NA Option
If there are missing values, then the mean function returns NA.

To drop the missing values from the calculation use na.rm =


TRUE. which means remove the NA values.

Live Demo
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5,NA)

# Find mean.
result.mean <- mean(x)
print(result.mean)

# Find mean dropping NA values.


result.mean <- mean(x,na.rm = TRUE)
print(result.mean)

When we execute the above code, it produces the following result



[1] NA
[1] 8.22

Median
The middle most value in a data series is called the median.
The median() function is used in R to calculate this value.
Syntax

The basic syntax for calculating median in R is −


median(x, na.rm = FALSE)

Following is the description of the parameters used −


 x is the input vector.
 na.rm is used to remove the missing values from the input
vector.
Example

Live Demo
# Create the vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)

# Find the median.


median.result <- median(x)
print(median.result)

When we execute the above code, it produces the following result



[1] 5.6

Mode
The mode is the value that has highest number of occurrences in
a set of data. Unike mean and median, mode can have both
numeric and character data.

R does not have a standard in-built function to calculate mode. So


we create a user function to calculate mode of a data set in R.
This function takes the vector as input and gives the mode value
as output.
Example

Live Demo
# Create the function.
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}

# Create the vector with numbers.


v <- c(2,1,2,3,1,2,3,4,1,5,5,3,2,3)

# Calculate the mode using the user function.


result <- getmode(v)
print(result)

# Create the vector with characters.


charv <- c("o","it","the","it","it")

# Calculate the mode using the user function.


result <- getmode(charv)
print(result)

When we execute the above code, it produces the following result



[1] 2
[1] "it"

R - Binomial Distribution
Previous
Next

The binomial distribution model deals with finding the probability


of success of an event which has only two possible outcomes in a
series of experiments. For example, tossing of a coin always gives
a head or a tail. The probability of finding exactly 3 heads in
tossing a coin repeatedly for 10 times is estimated during the
binomial distribution.

R has four in-built functions to generate binomial distribution.


They are described below.
dbinom(x, size, prob)
pbinom(x, size, prob)
qbinom(p, size, prob)
rbinom(n, size, prob)

Following is the description of the parameters used −

 x is a vector of numbers.
 p is a vector of probabilities.
 n is number of observations.
 size is the number of trials.
 prob is the probability of success of each trial.

dbinom()
This function gives the probability density distribution at each
point.

Live Demo
# Create a sample of 50 numbers which are incremented
by 1.
x <- seq(0,50,by = 1)

# Create the binomial distribution.


y <- dbinom(x,50,0.5)

# Give the chart file a name.


png(file = "dbinom.png")

# Plot the graph for this sample.


plot(x,y)

# Save the file.


dev.off()

When we execute the above code, it produces the following result


pbinom()
This function gives the cumulative probability of an event. It is a
single value representing the probability.

Live Demo
# Probability of getting 26 or less heads from a 51
tosses of a coin.
x <- pbinom(26,51,0.5)

print(x)

When we execute the above code, it produces the following result



[1] 0.610116

qbinom()
This function takes the probability value and gives a number
whose cumulative value matches the probability value.

Live Demo
# How many heads will have a probability of 0.25 will
come out when a coin
# is tossed 51 times.
x <- qbinom(0.25,51,1/2)

print(x)

When we execute the above code, it produces the following result



[1] 23

rbinom()
This function generates required number of random values of
given probability from a given sample.

Live Demo
# Find 8 random values from a sample of 150 with
probability of 0.4.
x <- rbinom(8,150,.4)

print(x)
When we execute the above code, it produces the following result

[1] 58 61 59 66 55 60 61 67

R - Poisson Regression
Previous
Next

Poisson Regression involves regression models in which the


response variable is in the form of counts and not fractional
numbers. For example, the count of number of births or number
of wins in a football match series. Also the values of the response
variables follow a Poisson distribution.

The general mathematical equation for Poisson regression is −


log(y) = a + b1x1 + b2x2 + bnxn.....

Following is the description of the parameters used −

 y is the response variable.


 a and b are the numeric coefficients.
 x is the predictor variable.

The function used to create the Poisson regression model is


the glm() function.
Syntax

The basic syntax for glm() function in Poisson regression is −


glm(formula,data,family)

Following is the description of the parameters used in above


functions −

 formula is the symbol presenting the relationship between


the variables.
 data is the data set giving the values of these variables.
 family is R object to specify the details of the model. It's
value is 'Poisson' for Logistic Regression.
Example
We have the in-built data set "warpbreaks" which describes the
effect of wool type (A or B) and tension (low, medium or high) on
the number of warp breaks per loom. Let's consider "breaks" as
the response variable which is a count of number of breaks. The
wool "type" and "tension" are taken as predictor variables.

Input Data

Live Demo
input <- warpbreaks
print(head(input))

When we execute the above code, it produces the following result



breaks wool tension
1 26 A L
2 30 A L
3 54 A L
4 25 A L
5 70 A L
6 52 A L

Create Regression Model


Live Demo
output <-glm(formula = breaks ~ wool+tension, data =
warpbreaks,
family = poisson)
print(summary(output))

When we execute the above code, it produces the following result



Call:
glm(formula = breaks ~ wool + tension, family = poisson, data = warpbreaks)

Deviance Residuals:
Min 1Q Median 3Q Max
-3.6871 -1.6503 -0.4269 1.1902 4.2616

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.69196 0.04541 81.302 < 2e-16 ***
woolB -0.20599 0.05157 -3.994 6.49e-05 ***
tensionM -0.32132 0.06027 -5.332 9.73e-08 ***
tensionH -0.51849 0.06396 -8.107 5.21e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 297.37 on 53 degrees of freedom


Residual deviance: 210.39 on 50 degrees of freedom
AIC: 493.06

Number of Fisher Scoring iterations: 4


In the summary we look for the p-value in the last column to be
less than 0.05 to consider an impact of the predictor variable on
the response variable. As seen the wooltype B having tension
type M and H have impact on the count of breaks.

You might also like