0% found this document useful (0 votes)

13 views35 pages

Unit 4

Uploaded by

gayathrinaik12

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views35 pages

Unit 4

Uploaded by

gayathrinaik12

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 35

Descriptive Analysis in R Programming

Last Updated : 23 May, 2024





In Descriptive statistics in R Programming Language, we describe our data with the help of various representative methods using charts,
graphs, tables, excel files, etc. In the descriptive analysis, we describe our data in some manner and present it in a meaningful way so
that it can be easily understood.
Most of the time it is performed on small data sets and this analysis helps us a lot to predict some future trends based on the current
findings. Some measures that are used to describe a data set are measures of central tendency and measures of variability or
dispersion.
Process of Descriptive Statistics in R
 The measure of central tendency
 Measure of variability

Measure of central tendency

It represents the whole set of data by a single value. It gives us the location of central points. There are three main measures of central
tendency:
 Mean
 Mode
 Median

Measure of variability
In Descriptive statistics in R measure of variability is known as the spread of data or how well is our data is distributed. The most
common variability measures are:
 Range
 Variance
 Standard deviation
Need of Descriptive Statistics in R
Descriptive Analysis helps us to understand our data and is a very important part of Machine Learning. This is due to Machine Learning
being all about making predictions. On the other hand, statistics is all about drawing conclusions from data, which is a necessary initial
step for Machine Learning. Let’s do this descriptive analysis in R.
Descriptive Analysis in R
Descriptive analyses consist of describing simply the data using some summary statistics and graphics. Here, we’ll describe how to
compute summary statistics using R software.
Import your data into R:
Before doing any computation, first of all, we need to prepare our data, save our data in external .txt or .csv files and it’s a best practice
to save the file in the current directory. After that import, your data into R as follow:
R

# R program to illustrate

# Descriptive Analysis

# Import the data using read.csv()

myData = read.csv("CardioGoodFitness.csv",

stringsAsFactors = F)

# Print the first 6 rows

print(head(myData))

Output:
Product Age Gender Education MaritalStatus Usage Fitness Income Miles
1 TM195 18 Male 14 Single 3 4 29562 112
2 TM195 19 Male 15 Single 2 3 31836 75
3 TM195 19 Female 14 Partnered 4 3 30699 66
4 TM195 19 Male 12 Single 3 3 32973 85
5 TM195 20 Male 13 Partnered 4 2 35247 47
6 TM195 20 Female 14 Partnered 3 3 32973 66
R functions for computing descriptive analysis:
Histogram of Age Distribution
R

library(ggplot2)

ggplot(myData, aes(x = Age)) +

geom_histogram(binwidth = 2, fill = "blue",

color = "red", alpha = 0.8) +

labs(title = "Age Distribution", x = "Age", y

= "Frequency")

Output:

Descriptive Analysis in R Programming

The ggplot2 library to create a histogram of the ‘Age’ variable from the ‘myData’ dataset. The histogram bins have a width of 2, and the
bars are filled with a teal color with a light gray border. The resulting visualization shows the distribution of ages in the dataset.
Boxplot of Miles by Gender
R

ggplot(myData, aes(x = Gender, y = Miles, fill

= Gender)) +

geom_boxplot() +

labs(title = "Miles Distribution by Gender",

x = "Gender", y = "Miles") +

theme_minimal()

Output:
Descriptive Analysis in R Programming

We create a boxplot visualizing the distribution of ‘Miles’ run, segmented by ‘Gender’ from the ‘myData’ dataset. Each boxplot represents
the interquartile range (IQR) of Miles for each gender. The plot is titled “Miles Distribution by Gender,” with ‘Gender’ on the x-axis and
‘Miles’ on the y-axis. The plot is styled with a minimal theme.
Bar Chart of Education Levels
R

ggplot(myData, aes(x = factor(Education), fill

= factor(Education))) +

geom_bar() +

labs(title = "Education Distribution", x =

"Education Level", y = "Count") +

theme_minimal()

Output:
Descriptive Analysis in R Programming

We generate a bar chart illustrating the distribution of ‘Education’ levels from the ‘myData’ dataset. Each bar represents the count of
observations for each education level. The chart is titled “Education Distribution,” with ‘Education Level’ on the x-axis and ‘Count’ on the
y-axis. The visualization adopts a minimal theme for a clean and simple presentation.
Mean
It is the sum of observations divided by the total number of observations. It is also defined as average which is the sum divided by count.

where n = number of terms

# R program to illustrate

# Descriptive Analysis

# Import the data using read.csv()

myData = read.csv("CardioGoodFitness.csv",

stringsAsFactors = F)

# Compute the mean value

mean = mean(myData$Age)

print(mean)

Output:
[1] 28.78889
Median
It is the middle value of the data set. It splits the data into two halves. If the number of elements in the data set is odd then the center
element is median and if it is even then the median would be the average of two central elements.

where n = number of terms

# R program to illustrate

# Descriptive Analysis

# Import the data using read.csv()

myData = read.csv("CardioGoodFitness.csv",

stringsAsFactors = F)

# Compute the median value

median = median(myData$Age)

print(median)

Output:
[1] 26
Mode
It is the value that has the highest frequency in the given data set. The data set may have no mode if the frequency of all data points is
the same. Also, we can have more than one mode if we encounter two or more data points having the same frequency.
R

# R program to illustrate

# Descriptive Analysis

# Import the library

library(modeest)

# Import the data using read.csv()

myData = read.csv("CardioGoodFitness.csv",

stringsAsFactors = F)
# Compute the mode value

mode = mfv(myData$Age)

print(mode)

Output:
[1] 25
Range
The range describes the difference between the largest and smallest data point in our data set. The bigger the range, the more is the
spread of data and vice versa.
Range = Largest data value – smallest data value

# R program to illustrate

# Descriptive Analysis

# Import the data using read.csv()

myData = read.csv("CardioGoodFitness.csv",

stringsAsFactors = F)

# Calculate the maximum

max = max(myData$Age)

# Calculate the minimum

min = min(myData$Age)

# Calculate the range

range = max - min

cat("Range is:\n")

print(range)

# Alternate method to get min and max

r = range(myData$Age)

print(r)

Output:
Range is:
[1] 32
[1] 18 50
Variance
It is defined as an average squared deviation from the mean. It is being calculated by finding the difference between every data point
and the average which is also known as the mean, squaring them, adding all of them, and then dividing by the number of data points
present in our data set.
where,
N = number of terms
u = Mean
R

# R program to illustrate

# Descriptive Analysis

# Import the data using read.csv()

myData = read.csv("CardioGoodFitness.csv",

stringsAsFactors = F)

# Calculating variance

variance = var(myData$Age)

print(variance)

Output:
[1] 48.21217
Standard Deviation
It is defined as the square root of the variance. It is being calculated by finding the Mean, then subtract each number from the Mean
which is also known as average and square the result. Adding all the values and then divide by the no of terms followed the square root.

where,
N = number of terms
u = Mean
R

# R program to illustrate

# Descriptive Analysis
# Import the data using read.csv()

myData = read.csv("CardioGoodFitness.csv",
stringsAsFactors = F)

# Calculating Standard deviation

std = sd(myData$Age)

print(std)

Output:
[1] 6.943498
Normal Distribution in R
Last Updated : 13 Apr, 2020





Normal Distribution is a probability function used in statistics that tells about how the data values are distributed. It is the most
important probability distribution function used in statistics because of its advantages in real case scenarios. For example, the height of
the population, shoe size, IQ level, rolling a dice, and many more. It is generally observed that data distribution is normal when there is a
random collection of data from independent sources. The graph produced after plotting the value of the variable on x-axis and count of
the value on y-axis is bell-shaped curve graph. The graph signifies that the peak point is the mean of the data set and half of the values
of data set lie on the left side of the mean and other half lies on the right part of the mean telling about the distribution of the values. The
graph is symmetric distribution. In R, there are 4 built-in functions to generate normal distribution:
 dnorm()

dnorm(x, mean, sd)

 pnorm()

pnorm(x, mean, sd)

 qnorm()

qnorm(p, mean, sd)

 rnorm()

rnorm(n, mean, sd)

where,

– x represents the data set of values – mean(x) represents the mean of data set x. It’s default value is 0.

– sd(x) represents the standard deviation of data set x. It’s default value is 1.

– n is the number of observations. – p is vector of probabilities

Functions To Generate Normal Distribution in R

dnorm()
dnorm() function in R programming measures density function of distribution. In statistics, it is measured by below formula-
where, is mean and is standard deviation. Syntax :

dnorm(x, mean, sd)

Example:

# creating a sequence of values

# between -15 to 15 with a difference of 0.1

x = seq(-15, 15, by=0.1)

y = dnorm(x, mean(x), sd(x))

# output to be present as PNG file

png(file="dnormExample.png")

# Plot the graph.

plot(x, y)

# saving the file

dev.off()
Output:

pnorm()
pnorm() function is the cumulative distribution function which measures the probability that a random number X takes a value less
than or equal to x i.e., in statistics it is given by-

Syntax:

pnorm(x, mean, sd)

Example:

# creating a sequence of values

# between -10 to 10 with a difference of 0.1

x <- seq(-10, 10, by=0.1)

y <- pnorm(x, mean = 2.5, sd = 2)

# output to be present as PNG file

png(file="pnormExample.png")
# Plot the graph.

plot(x, y)

# saving the file

dev.off()

Output :

qnorm()
qnorm() function is the inverse of pnorm() function. It takes the probability value and gives output which corresponds to the
probability value. It is useful in finding the percentiles of a normal distribution. Syntax:

qnorm(p, mean, sd)

Example:

# Create a sequence of probability values

# incrementing by 0.02.

x <- seq(0, 1, by = 0.02)

y <- qnorm(x, mean(x), sd(x))

# output to be present as PNG file

png(file = "qnormExample.png")

# Plot the graph.

plot(x, y)

# Save the file.

dev.off()

Output:

rnorm()
rnorm() function in R programming is used to generate a vector of random numbers which are normally distributed. Syntax:

rnorm(x, mean, sd)

Example:
# Create a vector of 1000 random numbers

# with mean=90 and sd=5

x <- rnorm(10000, mean=90, sd=5)

# output to be present as PNG file

png(file = "rnormExample.png")

# Create the histogram with 50 bars

hist(x, breaks=50)

# Save the file.

dev.off()

Output :

Binomial Distribution in R Programming

Last Updated : 10 May, 2020




Binomial distribution in R is a probability distribution used in statistics. The binomial distribution is a discrete distribution and has only
two outcomes i.e. success or failure. All its trials are independent, the probability of success remains the same and the previous outcome
does not affect the next outcome. The outcomes from different trials are independent. Binomial distribution helps us to find the individual
probabilities as well as cumulative probabilities over a certain range.
It is also used in many real-life scenarios such as in determining whether a particular lottery ticket has won or not, whether a drug is able
to cure a person or not, it can be used to determine the number of heads or tails in a finite number of tosses, for analyzing the outcome
of a die, etc.
Formula:

Functions for Binomial Distribution

We have four functions for handling binomial distribution in R namely:

 dbinom()

dbinom(k, n, p)
 pbinom()

pbinom(k, n, p)
where n is total number of trials, p is probability of success, k is the value at which the probability has to be found out.

 qbinom()

qbinom(P, n, p)
Where P is the probability, n is the total number of trials and p is the probability of success.

 rbinom()

rbinom(n, N, p)
Where n is numbers of observations, N is the total number of trials, p is the probability of success.

dbinom() Function
This function is used to find probability at a particular value for a data that follows binomial distribution i.e. it finds:

P(X = k)
Syntax:
dbinom(k, n, p)
Example:

dbinom(3, size = 13, prob = 1 / 6)

probabilities <- dbinom(x = c(0:10), size = 10,

prob = 1 / 6)

data.frame(x, probs)

plot(0:10, probabilities, type = "l")

Output :
> dbinom(3, size = 13, prob = 1/6)
[1] 0.2138454
> probabilities = dbinom(x = c(0:10), size = 10, prob = 1/6)
> data.frame(probabilities)
probabilities
1 1.615056e-01
2 3.230112e-01
3 2.907100e-01
4 1.550454e-01
5 5.426588e-02
6 1.302381e-02
7 2.170635e-03
8 2.480726e-04
9 1.860544e-05
10 8.269086e-07
11 1.653817e-08

The above piece of code first finds the probability at k=3, then it displays a data frame containing the probability distribution for k from 0
to 10 which in this case is 0 to n.

pbinom() Function
The function pbinom() is used to find the cumulative probability of a data following binomial distribution till a given value ie it finds
P(X <= k)
Syntax:
pbinom(k, n, p)
Example:

pbinom(3, size = 13, prob = 1 / 6)

plot(0:10, pbinom(0:10, size = 10, prob = 1 / 6),

type = "l")

Output :
> pbinom(3, size = 13, prob = 1/6)
[1] 0.8419226

qbinom() Function
This function is used to find the nth quantile, that is if P(x <= k) is given, it finds k.
Syntax:
qbinom(P, n, p)
Example:

qbinom(0.8419226, size = 13, prob = 1 / 6)

x <- seq(0, 1, by = 0.1)

y <- qbinom(x, size = 13, prob = 1 / 6)

plot(x, y, type = 'l')

Output :
> qbinom(0.8419226, size = 13, prob = 1/6)
[1] 3

rbinom() Function
This function generates n random variables of a particular probability.
Syntax:
rbinom(n, N, p)
Example:

rbinom(8, size = 13, prob = 1 / 6)

hist(rbinom(8, size = 13, prob = 1 / 6))

Output:
> rbinom(8, size = 13, prob = 1/6)
[1] 1 1 2 1 4 0 2 3
Poisson Distribution In R
Last Updated : 02 Feb, 2024





Poisson distribution is a probability distribution that expresses the number of events occurring in a fixed interval of time or space, given a
constant average rate. This distribution is particularly useful when dealing with rare events or incidents that happen independently. R
provides powerful tools for statistical analysis, making it an excellent choice for working with probability distributions like Poisson.
Poisson Distribution
Poisson distribution is a probability distribution that describes the number of events that occur within a fixed interval of time or space. If λ
is the mean occurrence per interval, then the probability of having x occurrences within a given interval is:

 P(X=k) represents the probability of observing k events.

 e is the base of the natural logarithm.
 λ is the average rate of event occurrences in a fixed interval.
 k is the actual number of events observed.
 k! denotes the factorial of k, which is the product of all positive integers up to k.
Uses Poisson distribution when
1. Events unfold randomly and autonomously, where the likelihood of one event occurring does not influence the likelihood of another.
2. The average rate of events within a specific timeframe or space, denoted as λ (lambda), is known and presumed to be consistent.
3. When events adhere to a Poisson distribution, λ serves as the singular parameter necessary for determining the probability of a
particular number of events taking place.
The probability of having thirty or more inquiries, we subtract the probability calculated above from 1 . This is because the probability in
the upper tail is complementary to the probability in the lower tail.

 R
# Probability of having thirty or fewer
inquiries

probability_30_or_less <- ppois(30, lambda =

20)

print(probability_30_or_less)

# Probability of having thirty or more

inquiries

probability_30_or_more <- 1 -
probability_30_or_less

print(probability_30_or_more)

Output:
[1] 0.9865253

[1] 0.01347468
 The probability of having thirty or fewer inquiries (P(X≤30)) is approximately 98.65%
 The probability of having thirty or more inquiries (P(X≥30)) is approximately 1.35%.
 This means that, in a minute, there is a high likelihood (98.65%) that the number of customer inquiries will be thirty or fewer, and a
low likelihood (1.35%) that it will be thirty or more, based on the given average rate.
Characteristics of Poisson distribution
1. Events Occur Independently: Poisson distribution assumes that events occur independently of each other. This means the
occurrence of one event does not affect the occurrence of another.
2. Constant Average Rate: The events happen at a constant average rate over a fixed interval of time or space.
3. Discrete Nature: The distribution is discrete, meaning it deals with whole numbers (0, 1, 2, …) as it represents the count of events.
Poisson Functions in R Programming
In R, several built-in functions to work with the Poisson distribution. The key functions include `dpois()`, `ppois()`, `qpois()`, and `rpois()`,
which correspond to the probability density function (PMF), cumulative distribution function (CDF), quantile function, and random number
generation, respectively.
1. dpois(x, lambda)
 This function calculates the probability mass function (PMF) of the Poisson distribution.
 It gives the probability of observing exactly `x` events in a Poisson distribution with mean (`lambda`).

 R

x <- 3

lambda <- 2

probability <- dpois(x, lambda)

print(probability)

Output:
[1] 0.180447
2. ppois(q, lambda)
 This function calculates the cumulative distribution function (CDF) of the Poisson distribution.
 It gives the probability of observing fewer than or equal to `q` events.

 R

q <- 2

lambda <- 3

cumulative_probability <- ppois(q, lambda)

print(cumulative_probability)

Output:
[1] 0.4231901
3. qpois(p, lambda)
 This function calculates the quantile function of the Poisson distribution.
 It returns the smallest integer `q` such that `ppois(q, lambda)` is greater than or equal to `p`.

 R

p <- 0.8

lambda <- 4

quantile_value <- qpois(p, lambda)

print(quantile_value)

Output:
[1] 6
4. rpois(n, lambda)
 This function generates random samples from a Poisson distribution.
 It produces `n` random values representing the count of events, where the mean is specified by `lambda`.

 R

n <- 10

lambda <- 5

random_samples <- rpois(n, lambda)

print(random_samples)

Output:
[1] 8 3 9 6 4 4 2 4 2 8
These functions are part of the base R package and are helpful for performing various operations related to the Poisson distribution.

R - Linear Regression
Regression analysis is a very widely used statistical tool to establish a relationship model
between two variables. One of these variable is called predictor variable whose value is
gathered through experiments. The other variable is called response variable whose value is
derived from the predictor variable.

In Linear Regression these two variables are related through an equation, where exponent
(power) of both these variables is 1. Mathematically a linear relationship represents a straight
line when plotted as a graph. A non-linear relationship where the exponent of any variable is
not equal to 1 creates a curve.

The general mathematical equation for a linear regression is −

y = ax + b

Following is the description of the parameters used −

 y is the response variable.

 x is the predictor variable.
 a and b are constants which are called the coefficients.
Steps to Establish a Regression
A simple example of regression is predicting weight of a person when his height is known.
To do this we need to have the relationship between height and weight of a person.

The steps to create the relationship is −

 Carry out the experiment of gathering a sample of observed values of height and
corresponding weight.
 Create a relationship model using the lm() functions in R.
 Find the coefficients from the model created and create the mathematical equation
using these
 Get a summary of the relationship model to know the average error in prediction.
Also called residuals.
 To predict the weight of new persons, use the predict() function in R.
Input Data

Below is the sample data representing the observations −

# Values of height
151, 174, 138, 186, 128, 136, 179, 163, 152, 131

# Values of weight.
63, 81, 56, 91, 47, 57, 76, 72, 62, 48

lm() Function
This function creates the relationship model between the predictor and the response variable.
Syntax

The basic syntax for lm() function in linear regression is −

lm(formula,data)

Following is the description of the parameters used −

 formula is a symbol presenting the relation between x and y.

 data is the vector on which the formula will be applied.
Create Relationship Model & get the Coefficients

Live Demo
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

# Apply the lm() function.

relation <- lm(y~x)

print(relation)

When we execute the above code, it produces the following result −

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept) x
-38.4551 0.6746

Get the Summary of the Relationship

Live Demo
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

# Apply the lm() function.

relation <- lm(y~x)

print(summary(relation))

When we execute the above code, it produces the following result −

Call:
lm(formula = y ~ x)

Residuals:
Min 1Q Median 3Q Max
-6.3002 -1.6629 0.0412 1.8944 3.9775

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -38.45509 8.04901 -4.778 0.00139 **
x 0.67461 0.05191 12.997 1.16e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.253 on 8 degrees of freedom

Multiple R-squared: 0.9548, Adjusted R-squared: 0.9491
F-statistic: 168.9 on 1 and 8 DF, p-value: 1.164e-06

predict() Function
Syntax

The basic syntax for predict() in linear regression is −

predict(object, newdata)

Following is the description of the parameters used −

 object is the formula which is already created using the lm() function.
 newdata is the vector containing the new value for predictor variable.
Predict the weight of new persons

Live Demo
# The predictor vector.
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)

# The resposne vector.

y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

# Apply the lm() function.

relation <- lm(y~x)
# Find weight of a person with height 170.
a <- data.frame(x = 170)
result <- predict(relation,a)
print(result)

When we execute the above code, it produces the following result −

1
76.22869

Visualize the Regression Graphically

Live Demo
# Create the predictor and response variable.
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
relation <- lm(y~x)

# Give the chart file a name.

png(file = "linearregression.png")

# Plot the chart.

plot(y,x,col = "blue",main = "Height & Weight Regression",
abline(lm(x~y)),cex = 1.3,pch = 16,xlab = "Weight in Kg",ylab = "Height in
cm")

# Save the file.

dev.off()

When we execute the above code, it produces the following result −

Print Page

R - Mean, Median and Mode

Previous
Next

Statistical analysis in R is performed by using many in-built

functions. Most of these functions are part of the R base package.
These functions take R vector as an input along with the
arguments and give the result.

The functions we are discussing in this chapter are mean, median

and mode.

Mean
It is calculated by taking the sum of the values and dividing with
the number of values in a data series.
The function mean() is used to calculate this in R.
Syntax

The basic syntax for calculating mean in R is −

mean(x, trim = 0, na.rm = FALSE, ...)

Following is the description of the parameters used −

 x is the input vector.

 trim is used to drop some observations from both end of the
sorted vector.
 na.rm is used to remove the missing values from the input
vector.
Example

Live Demo
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)

# Find Mean.
result.mean <- mean(x)
print(result.mean)

When we execute the above code, it produces the following result

−
[1] 8.22

Applying Trim Option

When trim parameter is supplied, the values in the vector get
sorted and then the required numbers of observations are
dropped from calculating the mean.

When trim = 0.3, 3 values from each end will be dropped from the
calculations to find mean.

In this case the sorted vector is (−21, −5, 2, 3, 4.2, 7, 8, 12, 18,
54) and the values removed from the vector for calculating mean
are (−21,−5,2) from left and (12,18,54) from right.

Live Demo
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
# Find Mean.
result.mean <- mean(x,trim = 0.3)
print(result.mean)

When we execute the above code, it produces the following result

−
[1] 5.55

Applying NA Option
If there are missing values, then the mean function returns NA.

To drop the missing values from the calculation use na.rm =

TRUE. which means remove the NA values.

Live Demo
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5,NA)

# Find mean.
result.mean <- mean(x)
print(result.mean)

# Find mean dropping NA values.

result.mean <- mean(x,na.rm = TRUE)
print(result.mean)

When we execute the above code, it produces the following result

−
[1] NA
[1] 8.22

Median
The middle most value in a data series is called the median.
The median() function is used in R to calculate this value.
Syntax

The basic syntax for calculating median in R is −

median(x, na.rm = FALSE)

Following is the description of the parameters used −

 x is the input vector.
 na.rm is used to remove the missing values from the input
vector.
Example

Live Demo
# Create the vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)

# Find the median.

median.result <- median(x)
print(median.result)

When we execute the above code, it produces the following result

−
[1] 5.6

Mode
The mode is the value that has highest number of occurrences in
a set of data. Unike mean and median, mode can have both
numeric and character data.

R does not have a standard in-built function to calculate mode. So

we create a user function to calculate mode of a data set in R.
This function takes the vector as input and gives the mode value
as output.
Example

Live Demo
# Create the function.
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}

# Create the vector with numbers.

v <- c(2,1,2,3,1,2,3,4,1,5,5,3,2,3)

# Calculate the mode using the user function.

result <- getmode(v)
print(result)

# Create the vector with characters.

charv <- c("o","it","the","it","it")

# Calculate the mode using the user function.

result <- getmode(charv)
print(result)

When we execute the above code, it produces the following result

−
[1] 2
[1] "it"

R - Binomial Distribution
Previous
Next

The binomial distribution model deals with finding the probability

of success of an event which has only two possible outcomes in a
series of experiments. For example, tossing of a coin always gives
a head or a tail. The probability of finding exactly 3 heads in
tossing a coin repeatedly for 10 times is estimated during the
binomial distribution.

R has four in-built functions to generate binomial distribution.

They are described below.
dbinom(x, size, prob)
pbinom(x, size, prob)
qbinom(p, size, prob)
rbinom(n, size, prob)

Following is the description of the parameters used −

 x is a vector of numbers.
 p is a vector of probabilities.
 n is number of observations.
 size is the number of trials.
 prob is the probability of success of each trial.

dbinom()
This function gives the probability density distribution at each
point.

Live Demo
# Create a sample of 50 numbers which are incremented
by 1.
x <- seq(0,50,by = 1)

# Create the binomial distribution.

y <- dbinom(x,50,0.5)

# Give the chart file a name.

png(file = "dbinom.png")

# Plot the graph for this sample.

plot(x,y)

# Save the file.

dev.off()

When we execute the above code, it produces the following result

−

pbinom()
This function gives the cumulative probability of an event. It is a
single value representing the probability.

Live Demo
# Probability of getting 26 or less heads from a 51
tosses of a coin.
x <- pbinom(26,51,0.5)

print(x)

When we execute the above code, it produces the following result

−
[1] 0.610116

qbinom()
This function takes the probability value and gives a number
whose cumulative value matches the probability value.

Live Demo
# How many heads will have a probability of 0.25 will
come out when a coin
# is tossed 51 times.
x <- qbinom(0.25,51,1/2)

print(x)

When we execute the above code, it produces the following result

−
[1] 23

rbinom()
This function generates required number of random values of
given probability from a given sample.

Live Demo
# Find 8 random values from a sample of 150 with
probability of 0.4.
x <- rbinom(8,150,.4)

print(x)
When we execute the above code, it produces the following result
−
[1] 58 61 59 66 55 60 61 67

R - Poisson Regression
Previous
Next

Poisson Regression involves regression models in which the

response variable is in the form of counts and not fractional
numbers. For example, the count of number of births or number
of wins in a football match series. Also the values of the response
variables follow a Poisson distribution.

The general mathematical equation for Poisson regression is −

log(y) = a + b1x1 + b2x2 + bnxn.....

Following is the description of the parameters used −

 y is the response variable.

 a and b are the numeric coefficients.
 x is the predictor variable.

The function used to create the Poisson regression model is

the glm() function.
Syntax

The basic syntax for glm() function in Poisson regression is −

glm(formula,data,family)

Following is the description of the parameters used in above

functions −

 formula is the symbol presenting the relationship between

the variables.
 data is the data set giving the values of these variables.
 family is R object to specify the details of the model. It's
value is 'Poisson' for Logistic Regression.
Example
We have the in-built data set "warpbreaks" which describes the
effect of wool type (A or B) and tension (low, medium or high) on
the number of warp breaks per loom. Let's consider "breaks" as
the response variable which is a count of number of breaks. The
wool "type" and "tension" are taken as predictor variables.

Input Data

Live Demo
input <- warpbreaks
print(head(input))

When we execute the above code, it produces the following result

−
breaks wool tension
1 26 A L
2 30 A L
3 54 A L
4 25 A L
5 70 A L
6 52 A L

Create Regression Model

Live Demo
output <-glm(formula = breaks ~ wool+tension, data =
warpbreaks,
family = poisson)
print(summary(output))

When we execute the above code, it produces the following result

−
Call:
glm(formula = breaks ~ wool + tension, family = poisson, data = warpbreaks)

Deviance Residuals:
Min 1Q Median 3Q Max
-3.6871 -1.6503 -0.4269 1.1902 4.2616

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.69196 0.04541 81.302 < 2e-16 ***
woolB -0.20599 0.05157 -3.994 6.49e-05 ***
tensionM -0.32132 0.06027 -5.332 9.73e-08 ***
tensionH -0.51849 0.06396 -8.107 5.21e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 297.37 on 53 degrees of freedom

Residual deviance: 210.39 on 50 degrees of freedom
AIC: 493.06

Number of Fisher Scoring iterations: 4

In the summary we look for the p-value in the last column to be
less than 0.05 to consider an impact of the predictor variable on
the response variable. As seen the wooltype B having tension
type M and H have impact on the count of breaks.

Unit 4-1
No ratings yet
Unit 4-1
21 pages
Descriptive Analysis in R Programming - GeeksforGeeks-1-12
No ratings yet
Descriptive Analysis in R Programming - GeeksforGeeks-1-12
12 pages
BQL Record PDF
No ratings yet
BQL Record PDF
65 pages
Stats Lab1
No ratings yet
Stats Lab1
11 pages
Business Analytics Unit 4
No ratings yet
Business Analytics Unit 4
24 pages
Unit 3
No ratings yet
Unit 3
11 pages
Advanced Statistics
No ratings yet
Advanced Statistics
259 pages
R Data Types 8
No ratings yet
R Data Types 8
7 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
26 pages
Business Analytics (Unit4 Chapter5)
No ratings yet
Business Analytics (Unit4 Chapter5)
7 pages
Unit V Statistics R
No ratings yet
Unit V Statistics R
60 pages
Descriptive Analytics
No ratings yet
Descriptive Analytics
42 pages
BDA 09 Shridhti Tiwari
No ratings yet
BDA 09 Shridhti Tiwari
12 pages
Unit 3 Eda Notes
No ratings yet
Unit 3 Eda Notes
24 pages
Basics of Data Analysis and Graphics in
No ratings yet
Basics of Data Analysis and Graphics in
103 pages
I Am Sharing 'DOC-20250811-WA0005.' With You
No ratings yet
I Am Sharing 'DOC-20250811-WA0005.' With You
16 pages
Genetica Cuantitativa
No ratings yet
Genetica Cuantitativa
120 pages
R Programming
No ratings yet
R Programming
8 pages
Descriptive and Inferential Stats Guide
No ratings yet
Descriptive and Inferential Stats Guide
49 pages
Capital Gains
No ratings yet
Capital Gains
8 pages
R For Data Exploration
No ratings yet
R For Data Exploration
52 pages
Exploratory Data Analysis - NOTES
No ratings yet
Exploratory Data Analysis - NOTES
31 pages
7CCMMS61 Statistics For Data Analysis: Francisco Javier Rubio Department of Mathematics
No ratings yet
7CCMMS61 Statistics For Data Analysis: Francisco Javier Rubio Department of Mathematics
13 pages
Packages Used in This Chapter: R Studio - Descriptive Statistics
No ratings yet
Packages Used in This Chapter: R Studio - Descriptive Statistics
9 pages
Module 5-6
No ratings yet
Module 5-6
12 pages
R for Big Data and Statistics
No ratings yet
R for Big Data and Statistics
57 pages
DWDM - Lab Manual1
No ratings yet
DWDM - Lab Manual1
40 pages
Muthayammal College of Arts and Science Rasipuram: Assignment No - 3
No ratings yet
Muthayammal College of Arts and Science Rasipuram: Assignment No - 3
8 pages
Descriptive Statistics and Exploratory Data Analysis
No ratings yet
Descriptive Statistics and Exploratory Data Analysis
36 pages
Data Analysis - Statistics
No ratings yet
Data Analysis - Statistics
68 pages
R Module 5
No ratings yet
R Module 5
21 pages
R Programming: Descriptive Stats Guide
No ratings yet
R Programming: Descriptive Stats Guide
3 pages
Decriptive Statistics in Data Science
No ratings yet
Decriptive Statistics in Data Science
9 pages
BA - Unit 4 (P2)
No ratings yet
BA - Unit 4 (P2)
17 pages
Week 8 Quantitative Data Analysis - Descriptive Statistics
No ratings yet
Week 8 Quantitative Data Analysis - Descriptive Statistics
59 pages
Lesson 1
No ratings yet
Lesson 1
37 pages
Unit 4 Ba Shivdas
No ratings yet
Unit 4 Ba Shivdas
17 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
43 pages
CH 2 Lecture Notes
No ratings yet
CH 2 Lecture Notes
12 pages
Module V 1
No ratings yet
Module V 1
7 pages
Descriptive Statistic
No ratings yet
Descriptive Statistic
37 pages
Descriptive Analytics
No ratings yet
Descriptive Analytics
6 pages
Stats 1, Lecture
No ratings yet
Stats 1, Lecture
11 pages
Introduction To Descriptive Statistics I: Sanju Rusara Seneviratne Mbpss
No ratings yet
Introduction To Descriptive Statistics I: Sanju Rusara Seneviratne Mbpss
35 pages
TEB2043 Introduction To Data Science: Descriptive Analytics & Visualization DR Shuhaida Mohamed Shuhidan JAN 2025
No ratings yet
TEB2043 Introduction To Data Science: Descriptive Analytics & Visualization DR Shuhaida Mohamed Shuhidan JAN 2025
29 pages
Data Visualizations: Histograms
No ratings yet
Data Visualizations: Histograms
27 pages
Nummerical Summaries
No ratings yet
Nummerical Summaries
11 pages
Business Analytics Unit - IV Notes - 60637706 - 2025 - 05!15!02 - 16
No ratings yet
Business Analytics Unit - IV Notes - 60637706 - 2025 - 05!15!02 - 16
28 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
15 pages
Unit 2
No ratings yet
Unit 2
32 pages
Statistics Basics for Data Science
100% (1)
Statistics Basics for Data Science
27 pages
Predictive Analytics Unit I1
No ratings yet
Predictive Analytics Unit I1
21 pages
Descriptive Statistics Guide
No ratings yet
Descriptive Statistics Guide
33 pages
Wa Nko Nalipay PR
No ratings yet
Wa Nko Nalipay PR
12 pages
Descripti VE Statistics and Data Visualization: January 14, 2020
No ratings yet
Descripti VE Statistics and Data Visualization: January 14, 2020
34 pages
CB161 (R Lab Manual)
No ratings yet
CB161 (R Lab Manual)
32 pages
Further Bound Reference
No ratings yet
Further Bound Reference
42 pages
ch2 (Descriptive Statistics)
No ratings yet
ch2 (Descriptive Statistics)
18 pages
Java Lab Cbcs
No ratings yet
Java Lab Cbcs
36 pages
V and Vi Sem
No ratings yet
V and Vi Sem
19 pages
ADA Module 2
No ratings yet
ADA Module 2
24 pages
ADA Module 4
No ratings yet
ADA Module 4
27 pages
Final Nep Java Lab
No ratings yet
Final Nep Java Lab
40 pages
Lab Programs
No ratings yet
Lab Programs
22 pages
Unit 1
No ratings yet
Unit 1
42 pages
Unit 5 To Students
No ratings yet
Unit 5 To Students
41 pages
Unit 3-1
No ratings yet
Unit 3-1
54 pages
Unit 2
No ratings yet
Unit 2
41 pages
Unit III
No ratings yet
Unit III
80 pages
WEB Programming Laboratory Manual BSC 6 Sem: Web Design Lab HTML
No ratings yet
WEB Programming Laboratory Manual BSC 6 Sem: Web Design Lab HTML
54 pages
Java Programming Basics Guide
No ratings yet
Java Programming Basics Guide
51 pages
1 Hdfs Notes
No ratings yet
1 Hdfs Notes
38 pages
Ejemplo01 - BD Torneo
No ratings yet
Ejemplo01 - BD Torneo
2 pages
SQL Lab Manual for Students
No ratings yet
SQL Lab Manual for Students
43 pages
Anti-Virus Exclusions With 8.5
No ratings yet
Anti-Virus Exclusions With 8.5
3 pages
A Project Report ON Library Management System: Niranjan KC Sagar Kunjyang Tamang Prajwal Bhandari
No ratings yet
A Project Report ON Library Management System: Niranjan KC Sagar Kunjyang Tamang Prajwal Bhandari
14 pages
Retail Store Management System Project Report
No ratings yet
Retail Store Management System Project Report
81 pages
Btech Oe 6 Sem Basics of Data Base Management System Koe 067 2023
No ratings yet
Btech Oe 6 Sem Basics of Data Base Management System Koe 067 2023
2 pages
21440791
No ratings yet
21440791
27 pages
APEX Wizard Application Report Forms Latest
No ratings yet
APEX Wizard Application Report Forms Latest
45 pages
Online Shopping Portal Documentation
No ratings yet
Online Shopping Portal Documentation
72 pages
LIS 109 Lecture22
No ratings yet
LIS 109 Lecture22
26 pages
Farmers Buddy: Project Report On
50% (2)
Farmers Buddy: Project Report On
12 pages
Classify function-ArcGIS Pro - Documentation
No ratings yet
Classify function-ArcGIS Pro - Documentation
2 pages
Classification of Data Mining Systems
No ratings yet
Classification of Data Mining Systems
7 pages
Business Analytics - Canvas - IMP - QB
No ratings yet
Business Analytics - Canvas - IMP - QB
3 pages
System Design Spec for Developers
No ratings yet
System Design Spec for Developers
12 pages
Ip Covid Proj 24-25 Board Final
No ratings yet
Ip Covid Proj 24-25 Board Final
35 pages
Ai Btech 4 Scheme 2023040924035936
No ratings yet
Ai Btech 4 Scheme 2023040924035936
14 pages
PGDM Big Data & Analytics Lecture
No ratings yet
PGDM Big Data & Analytics Lecture
17 pages
DBMS Lab Mannual
No ratings yet
DBMS Lab Mannual
45 pages
Grade XII Informatics Practices PA1
No ratings yet
Grade XII Informatics Practices PA1
6 pages
Optimizing Information Leakage in Multicloud Storage Services
No ratings yet
Optimizing Information Leakage in Multicloud Storage Services
14 pages
Customer Data Analysis
No ratings yet
Customer Data Analysis
14 pages
SADCW 7e Chapter10
No ratings yet
SADCW 7e Chapter10
25 pages
1.3 Informix Database Administration Lab
No ratings yet
1.3 Informix Database Administration Lab
57 pages
Questionaire - Solution Architect and Technical Architect
No ratings yet
Questionaire - Solution Architect and Technical Architect
4 pages
Factor de Mantenimiento Iso-Cie-Ts-22012-2019
No ratings yet
Factor de Mantenimiento Iso-Cie-Ts-22012-2019
12 pages
GCP Tech Leap Dumps Latest 2023
No ratings yet
GCP Tech Leap Dumps Latest 2023
147 pages
4
No ratings yet
4
3 pages
Health Information Systems Guide
No ratings yet
Health Information Systems Guide
11 pages