TEB2043
Introduction to Data Science
Descriptive Analytics & Visualization
Dr Shuhaida Mohamed Shuhidan
JAN 2025
Credit: Ts Dr Nurul Aida Osman
Learning Outcomes
At the end of this session, you will be able to:
• Explain about descriptive analytics
• Implement R code for descriptive statistics
• Implement R code for visualization
2
I. Descriptive Analytics
Descriptive Analytics
• The examination of data or content, usually manually performed, to answer the question “What
happened?” (or What is happening?), characterized by traditional business intelligence (BI) and
visualizations such as pie charts, bar charts, line graphs, tables, or generated narratives. (gartner.com)
• The interpretation of historical data to better understand changes that have occurred in a business.
Descriptive analytics describes the use of a range of historic data to draw comparisons. Most commonly
reported financial metrics are a product of descriptive analytics, for example, year-over-year pricing
changes, month-over-month sales growth, the number of users, or the total revenue per subscriber. These
measures all describe what has occurred in a business during a set period. (investopedia.com)
4
Descriptive Statistics
• Data are described in measurements e.g., measures of central tendency and measures of variability or
dispersion.
• Central tendency - represents the whole set of data by a single value. It gives us the location of central
points.
o Mean
o Mode
o Median
• Variability/dispersion - the spread of data or how well is our data is distributed.
o Range
o Variance
o Standard deviation
5
Descriptive Statistics
• Import iris data from library
data <- iris
• The use of head function can describe the data
head(data)
• Or tail function
tail(data)
6
Descriptive Statistics AL2 Think Pair Share
(Quiz #2 - Part of 2%)
• The str function is used to describe the structure of the data
str(data)
Quiz #2
Define Factor and discuss the meaning of “Factor w/ 3 levels” in the above figure.
7
Descriptive Statistics
• The minimum and maximum values from the data can be determined using min and max functions
respectively.
min(data$Sepal.Length) #this produces 4.3
max(data$Sepal.Length) #this produces 7.9
• Alternatively, minimum and maximum values can be determined by range function.
range(data$Sepal.Length)
• The above code produces 4.3 7.9 which we can extract any of the values using index.
range(data$Sepal.Length)[1] #this produces 4.3
range(data$Sepal.Length) [2] #this produces 7.9
• Similarly, the code above can be written as:
range_val <- range(data$Sepal.Length)
range_val[1]
range_val[2]
• Sometimes, we may want to determine the range value; we may do that using min and max functions:
the_range <- max(data$Sepal.Length)-min(data$Sepal.Length)
the_range #this produces 3.6
8
Descriptive Statistics AL3 In-Class Teams
(Quiz #2 - Part of 2%)
• The mean and median of data can be determined with the mean and median functions.
mean(data$Sepal.Length) #this produces 5.843333
median(data$Sepal.Length) #this produces 5.8
Quiz #2:
How about mode? Investigate about the mode calculation in R
9
Descriptive Statistics
• Standard deviation and variance can be determined with the sd and var functions respectively.
sd(data$Sepal.Length) #this produces 0.8280661
var(data$Sepal.Length) #this produces 0.6856935
• Variance (σ2) is the average of the squared differences from the Mean.
• Standard deviation (σ) is the square root of the Variance.
Normal distribution
Image source: mathisfun.com
10
Descriptive Statistics
• We have seen how summary function is used to generate useful descriptive statistics.
summary(data) summary(data$Sepal.Length)
• summary function can be further varied using by function as follows:
by(data, data$Species, summary)
11
Descriptive Statistics
• Data can be divided into quartiles.
• First quartile (lower quartile) → the value that cuts off the first 25% of the data when it is sorted in
ascending order.
• Third quartile (upper quartile) → the value that cuts off the first 75% when it is sorted in ascending order.
• Assume that a vector A contains 9 values:
A<-c(170.2, 181.5, 188.9, 163.9, 166.4, 163.7, 160.4, 175.8, 181.5)
• Using quantile function:
quantile(A)
1st Quartile 3rd Quartile n=9
1st Quartile=0.25*9
Round up 2.25=3
• Let’s check: 3rd value=163.9
sort(A)
n=9
3rd Quartile=0.75*9
Round up 6.75=7
7th value=181.5 12
Descriptive Statistics
• The quantile function can be used for specific quartile:
quantile(A,0.25)
quantile(A,0.75)
• As expected, other quartiles can also be determined using the quantile function:
quantile(A,0.4)
quantile(A,0.8)
*find out how R determines the results for the above code
• There is a specific function known as IQR that calculates the interquartile range (i.e., the difference between
the 3rd and the 1st quartiles):
IQR(A)
13
Descriptive Statistics
• Other commonly used descriptive statistics:
o Counting the number of rows
nrow(data)
nrow(data[‘Sepal.Length’])
o Counting the number of columns
ncol(data)
o Counting the number of NA
sum(is.na(data$Sepal.Length))
o Counting the number of negative values
sum(data$Sepal.Length<0)
o Counting the number of unique text-based values (non-numeric)
B<-c(rep("Yellow",2),rep("Red",3),rep("Yellow",3),rep("Black",3))
factor(B)
14
II. Visualization
Scatter Plot
• Other than iris, there are a number of other built-in datasets provided by R:
data()
• Import mtcars data:
data <- mtcars
• View the mtcars data:
data
• To plot a scatter plot, x and y values have to be defined, and plot function is used:
x<- -10:10
y<-x*x
plot(x,y,xlab='x',ylab='y',col='red')
16
Scatter Plot
• Let’s plot mpg (y) vs wt (x)
x<-data$wt
y<-data$mpg
plot(x,y,xlab='wt',ylab='mpg',col='green')
17
Histogram
hist(data$mpg,col=“green”)
18
Bar Chart
val=data$mpg
carnames=row.names(data)
barplot(val,ylab='mpg',main="Car - MPG",names.arg= carnames,
cex.names=0.6,las=2,col="blue")
19
More on Scatter Plot
my<-read.csv("C:/Users/nurulaida/OneDrive - Universiti Teknologi
PETRONAS/R/IDS/covid_my.csv")
my
x<-1:15
y<-my$Confirmed
plot(x,y,pch=16,col='blue',ylab='Confirmed case',main="Covid-19 Confirmed
Cases in Malaysia")
text(x,y,labels=my$State,pos=4,cex=0.5)
20
More on Bar Chart
val=my$Deaths
name_st=my$State
barplot(val,ylab='Deaths',main="Covid-19 Deaths in Malaysia",names.arg=
name_st, cex.names=0.6,las=2,col="orange")
21
Pie Chart
lbl=my$State
val2=my$Confirmed
pie(val2,lbl,cex=0.5)
22
3D Pie Chart
library(plotrix)
val2=my$Confirmed
lbl=my$State
pie3D(val2,
col = hcl.colors(length(val2), "Spectral"),
border = "white",
labels=lbl,labelcex=0.5)
23
Exploded 3D Pie Chart
library(plotrix)
val2=my$Confirmed
lbl=my$State
pie3D(val2,
col = hcl.colors(length(val2), "Spectral"),
border = "white",
labels=lbl,labelcex=0.5,explode=0.2)
24
Exploded 3D Pie Chart
val2=my[my$Confirmed>300000,]
val3=val2$Confirmed
lbl=paste(val2$State,val2$Confirmed,sep=",")
pie3D(val3,
col = hcl.colors(length(val3), "Spectral"),
border = "white",
labels=lbl,labelcex=0.5,explode=0.2)
25
Case Study 1 – Stacked Bar Chart AL5 Reflection
(Quiz #2 - Part of 2%)
val2=my[my$Confirmed>300000,]
tbl1=val2[c("Confirmed","Population")]
legendval=c("Confirmed","Population")
colors=c("green","orange")
row.names(tbl1)=val2$State
tbl2=t(tbl1)
options(scipen=999)
barplot(as.matrix(tbl2),col=colors,cex.names=0.8,las=2, cex.axis=0.8)
legend("topright", legendval, cex=0.8, fill=colors)
26
Case Study 2 – Geomap AL5 Reflection
(Quiz #2 - Part of 2%)
library(rworldmap) #to get a Malaysia map
library(tidyverse)
library(tidygeocoder)
mydat<-read.csv("C:/Users/nurulaida.osman/OneDrive - Universiti Teknologi
PETRONAS/R/IDS/covid_my.csv")
global <- map_data("world") #get map
ggplot() +
geom_polygon(data = global %>% filter(region == "Malaysia"), aes(x=long, y =
lat, group=group),
fill = "lightskyblue1") +
coord_fixed(1.3) +
geom_point(data = mydat, aes(x = Long, y = Lat),color="red") +
geom_text(
data = mydat,label=paste(mydat$State,mydat$Confirmed,sep=","), aes(x = Long,
y = Lat),
nudge_x = 0.25, nudge_y = 0.25,
color = "black", size=1.5
) +
theme_void()
27
Case Study 2 – Geomap AL5 Reflection
(Quiz #2 - Part of 2%)
28
Summary
You have learned…
✓ Descriptive analytics
✓ Descriptive statistics that are important for descriptive analytics (and data preparation)
✓ Visualization
More variations of visualization can be produced using…
➢ ggplot2
➢ plotly
➢ leaflet
Next…
❖ More on data cleaning
❖ Feature settings
29