R1 Uptovisualisation
R1 Uptovisualisation
Prepared By:
Nur-E-Faeeza Ankhi
Lecturer, DURP, BUET
Acknowledgement:
Niaz Mahmud Zafri
Assistant Professor, DURP, BUET
1
Analyzing Data with R
Plan 298: Data Analytics
2
3
R Download, Install and Opening
Go to the link: https://posit.co/download/rstudio-desktop/
✓ 1st download and install latest version of R base
✓ Then download and install latest version of R Studio Desktop
(RStudio can be downloaded for free and R base is required to use RStudio)
Environment
Source
Source can work
as calculator
Console
Also shows if
there is any error
Output
5
Database Preparation
To check:
Structure of the database use str() function: str(df) [shows (observation, variables), variable types]
Dimension of the database use dim() function: dim(df) [shows (observation, variables)]
To open the database use View() function: View(df)
To print the database in console use print() function or type the database name only: print(df) or df
To check top n observations use head() function: head(df, n=20) [head(df,20)]
To check bottom n observations use tail() function: tail(df, n=25)
To access a particular variable among the dataset use $ sign: df$name
Do Practice:
Gender Age Age_Group Residence Zone Type Education Occu Occupation Mon_per_income
Female 19 to 30 Young Daskhin Khan Less Sensitive Zone SSC/HSC Student Student 5,000-19,999
Male 46 to 65 Old Demra Less Sensitive Zone Graduation Business Employed 1,00,000 and above
Male 46 to 65 Old Jatrabari High Sensitive Zone SSC/HSC Labor/Daily Job Employed 5,000-19,999
Male 31 to 45 Middle-aged Sabujbagh High Sensitive Zone SSC/HSC Private Service Employed 5,000-19,999
Female 19 to 30 Young Shyampur High Sensitive Zone Graduation Others Employed 5,000-19,999
Female 19 to 30 Young Mohammadpur High Sensitive Zone Graduation Student Student 20,000-39,999
Male 31 to 45 Middle-aged Lalbagh High Sensitive Zone Graduation Student Student 20,000-39,999
Male 46 to 65 Old Rampura High Sensitive Zone More than graduation Private Service Employed 60,000-99,999
Male 19 to 30 Young Rampura High Sensitive Zone SSC/HSC Student Student 5,000-19,999
9
Database Preparation
Do Practice: #Create Age variable: Integer
Age<-c(28, 33, 43, 55, 67, 21, 32, 43, 65, 45)
Age
class(Age)
length(Age)
#Create Weight variable: Numeric
Weight<-c(77.6, 56.7, 60.9, 55.9, 72.6, 68.4, 66.2, 54.7, 75.3, 71.4)
Weight
class(Weight)
length(Weight)
#Create Health Condition variable: Health Condition
Health_condition<-factor(c("Very Good", "Bad","Moderate","Good",
"Good", "Very Bad", "Bad", "Moderate", "Very Good", "Good"))
Health_condition
class(Health_condition)
Solution: length(Health_condition)
##Create database #Create database
#Create ID variable: Character database1<-data.frame(ID, Name, Gender, Age, Weight,
ID<-c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10") Health_condition)
ID
database1
class(ID)
length(ID)
str(database1)
#Create Name variable: Character dim(database1)
Name<-c("Firoz", "Rafiq", "Fatema", "Akter", "Nafi", "Sumon", "sadia", head(database1, 5)
"Sommo", "Sirin", "parvin") tail(database1, 5)
Name View(database1)
class(Name) #Create a database by using ID, Age, Weight, Health_condition of "database1" dataframe
length(Name) database2<-data.frame(database1$ID, database1$Age,database1$Health_condition)
#Create Gender variable: Factor database2
Gender<-factor(c("Male", "Male", "Female", "Female", "Male", str(database2)
"Male","Female", "Male", "Female", "Female")) dim(database2)
Gender
head(database2, 5)
class(Gender)
length(Gender)
tail(database2, 5)
View(database2)
Analyzing Data with R
Plan 298: Data Analytics
10
11
Database Preparation
Spreadsheet-like view
First, create a blank database using data.frame() function by providing a name of that database
Use edit() or fix() function to create variable and enter data
If you use edit() function you have to use assignment operator to save the database according to its name
To rename variable, you can use these functions
In summary, fix() modifies the original object directly, while edit() returns a modified copy that needs to be reassigned.
Example - R script:
travel_database <-data.frame()
travel_database <- edit(travel_database)
travel_database
edit(travel_database)
travel_database
fix(travel_database)
travel_database
Matrix format
The numeric values in the data frame might alternatively be stored in a matrix with the same dimensions, i.e., 5 rows × 2 columns.
No variable name. Data store following row and column number.
Data frame to matrix conversion:
Structure: “New matrix database name” “<−” “as.matrix(data frame name)”
Example: DF_Matrix <−as.matrix(travel_database)
12
Database: Export/Save
First, you need to set directory where you want to save the file. It can be done through setwd()
function or manually using the steps:
Session→ Set working directory→ Choose directory→ Select appropriate directory.
To know the existing working directory use getwd() function. Also see: file.choose() function
Structure:
install.packages(“Package name”)
library(Package name)
Function()
To get details about a package, use help tool: - Structure: ?Package name
Example: ?ggplot2
2. Recode
Character Variable
Automatically recode using as.factor() function (alphabetical order)
travel_data$gender <- as.factor(travel_data$gender)
To recode according to your necessity use factor() function and input levels and labels within the function
travel_data$gender <- factor(travel_data$gender, levels= c(“Male”, “Female”), labels=c(“Man”, “Woman”))
If the character variable is a ordinal variable then use factor() function and input levels, labels, and ordered within the
function
travel_database$Quality<-factor(travel_database$Quality, levels = c(1, 2, 3),
labels c("bad","moderate","good"), ordered = T)
18
2. Recode
Numeric Variable
To recode according to your necessity use: dplyr packages
mutate() function
“New database name” <- “mutate(existing database name, “new variable
name”= “ifelse(condition, return if TRUE, return if FALSE)”)”.
ifelse() conditional function
#recode-------------------
#Package install
install.packages(dplyr)
library(dplyr)
#create variable-----------------------------------
var1<-c(1, 5, 6, 7, 12, 1, 17, 7, 5, 9)
var2<-c("F", "M", "F", "Z", "M", "M", "Z", "F", "F", "M")
#Database creation---------------------------------
data<-data.frame(var1, var2)
data
#Recode var1--------------------------------------
#if var1<10 then code it good, otherwise bad
data1<-mutate(data, recode_var1= ifelse(var1<10, "good", "bad"))
data1
#if var1<7 then code it good, var1 is in between 7-14 then moderate, otherwise bad
data2<-mutate(data, recode_var2= ifelse(var1<7, "good", ifelse(var1>=7 & var1<14, "moderate", "bad")))
data2
#Recode var2---------------------------------------
#if var2 is M/F then Human, otherwise animal
data3<-mutate(data, recode_var3= ifelse(var2=="F"|var2=="M", "human", "animal"))
data3
19
3. Transform Variable
The mutate() function of dplyr package allows you to create new variables or
transform existing ones.
#Transform variable------------------------
#Package install
install.packages(dplyr)
library(dplyr)
#Variable creation-------------------------
var0<-1:10
var1<-c(1, 5, 6, 7, 12, 1, 17, 7, 5, 9)
var2<-c("F", "M", "F", "Z", "M", "M", "Z", "F", "F", "M")
var3<-c("X", "X", "Y", "Y", "X", "X", "Y", "Y", "X", "X")
var4<-c(1.5, 5.5, 4.9,1.9, 8.5, 3.3, 4.5, 5.9, 8.1, 3.7)
#Dataset preparation-----------------------
data1<-data.frame(var0, var1, var2, var3, var4)
data1
#Create a new variable:var1*5---------------
data2<-mutate(data1,var5=var1*5)
data2
#Create two new variable: var4/4, var1+var4------
data3<-mutate(data1,var6=var4/5, var7=var1+var4)
data3
#Create three variables: log10(var0), var1^2, sqrt(var4)------
data4<-mutate(data1,var8=log10(var0), var9=var1^2, var10=sqrt(var4))
data4
20
4. Filter Columns
The select() function of dplyr packages allows you to limit your dataset to specified variables (columns).
keep the variables name, height, and gender of ‘starwars’ dataset
newdata <- select(starwars, name, height, gender)
keep the variables name and all variables between mass and species inclusive of ‘starwars’ dataset
newdata <- select(starwars, name, mass:species)
keep all variables except birth_year and gender of ‘starwars’ dataset
newdata <- select(starwars, -birth_year, -c(gender:species))
5. Filter Row
The filter() function of dplyr packages allows you to limit your dataset to observations (rows) meeting a specific criteria.
Multiple criteria can be combined with the & (AND) and | (OR) symbols
To sort a variable, use View() function. Then, click on the icon near variable name
View(data1)
23
6. Merge Dataset
Merging datasets is commonly required when data on single units are stored in multiple tables or datasets.
We consider a simple example where variables id, year, female, and inc are available in one dataset, and variables id and
maxval in a second.
data3 <- merge(data1, data2, by="id", all=TRUE)
27
28
Descriptive Statistics
✓ Descriptive statistics summarize the main features of a dataset. They provide a way to
understand the basic characteristics of data through measures of central tendency, variability
(spread), and distribution shape
✓ Does not draw conclusions or make inferences about a population
1. Percentile Values
2. Central Tendency
i. Mean
ii. Median
iii. Mode
iv. Sum
3. Dispersion
i. Standard deviation
ii. Variance
iii. Maximum
iv. Minimum
v. Range
4. Distribution
i. Skewness
ii. kurtosis
29
1. Percentile Values
Numerical data can be sorted in increasing or decreasing order. Thus the values of a numerical data set have a rank
order. A percentile is the value at a particular rank.
Quartiles
The Quartiles also divide the data into
divisions of 25% (4 division), so:
Quartile 1 (Q1) can be called the 25th
percentile
Quartile 2 (Q2) can be called the 50th
percentile
Quartile 3 (Q3) can be called the 75th
percentile
Median
The "median" is the "middle" value in the list of numbers. To find the median, your numbers have to be
listed in numerical order from smallest to largest, so you may have to rewrite your list before you
can find the median.
Mode
The "mode" is the value that occurs most often. If no number in the list is repeated, then there is no mode
for the list.
Sum/ Total
32
3. Dispersion
Standard deviation: These
measures indicate that, on average,
the data points deviate from the
mean by approximately the ‘value of
Standard deviation’ units Individual deviation
✓ A summary measure of the from mean
differences of each
observation from the mean Avg deviation from
✓ Small Std. Dev. means more mean = Std dev Mean
values near mean, less Maximum
spread of histogram
Range
Variance: Square of Std. Dev.
Maximum: Maximum value of the
data set
Minimum: Minimum value of the
data set
Range: Maximum-Minimum
Minimum
33
4. Distribution
Skewness:
…measures the asymmetry of the data distribution around its mean
✓ If skewness is close to zero, it suggests a normal distribution, where data is symmetrically spread around the mean
(within -1 to +1)
✓ A positive skewness indicates that the data has a longer tail on the right (more values are concentrated on the left -
For skewness value > 1 =right skewed, positive skewed)
✓ Negative skewness means a longer left tail (more values are concentrated on the right- skewness < -1 = left
skewed distribution, negative skewed)
Necessity
✓ Many statistical methods (e.g., regression, t-tests) assume that the data is normally distributed. Skewness can
indicate whether this assumption is valid
✓ Identifying outliers
✓ Skewness provides insight into the underlying patterns of the data. For example, if the data represents income
distribution, a positive skew might indicate that a few individuals earn significantly more than the rest
✓ In fields such as finance, environmental studies, and transportation, data can often be skewed. Understanding this
skewness can improve model selection and predictive accuracy
34
4. Distribution
Skewness:
35
4. Distribution
Skewness: https://www.youtube.com/watch?v=XSSRrVMOqlQ&t=57s
36
4. Distribution
Kurtosis:
…measures the tailedness and peak shape of a
data distribution. It helps identify whether a
dataset has more or fewer extreme values than
expected in a normal distribution
✓ Kurtosis > 0: Leptokurtic (positive
value)
✓ Kurtosis=0: Normal
✓ Kurtosis <0: Platykurtic (negative
value)
Necessity
✓ Many statistical methods assume a normal
distribution (mesokurtic). Knowing the
kurtosis helps assess whether this
assumption is valid
✓ High kurtosis can indicate the presence of
extreme outliers, which may need to be
addressed in data analysis or modeling.
37
In Built Function for Descriptive Statistics
Use picture functions to calculate required descriptive statistics
Also check fivenum() for finding Min, Q1, Median, Q3, Max
install.packages("psych")
library(psych)
describe(Pedestrian) ## you can use other functions too for summary
desc <- describe(Pedestrian)
Suppose, we have two variables named A and B. ##Create table: Gender will be rows, Pattern will be columns
Then, R script will be: mytable <- table(Pedestrian$Gender,Pedestrian$Pattern)
Create table: A will be rows, B will be columns Mytable
mytable <- table(A,B) ##Gender frequencies (summed over pattern)
A frequencies (summed over B) margin.table(mytable, 1)
margin.table(mytable, 1) ##Pattern frequencies (summed over gender)
B frequencies (summed over A) margin.table(mytable, 2)
margin.table(mytable, 2) ##Total percentages
table1<-prop.table(mytable)
Total percentages round(table1,2)
prop.table(mytable) ##Row percentage
Row percentage prop.table(mytable, 1)
prop.table(mytable, 1) ##Column percentage
Column percentage prop.table(mytable, 2)
prop.table(mytable, 2)
8+1298/137*100
Analyzing Data with R
Plan 298: Data Analytics
46
47
✓ If you want to create a 2D graph, then, you need to entry these value.
✓ If you do not entry any of the element value, system will provide by default value.
✓ In case of 3D/4D/5D graph, elements need to place in aes() function based on which you differentiate data.
✓ Others will remind within geom function
50
ggplot: Define Data
-- Need to use ggplot() function to define data and aes() function to built up the structure of the graph
Structure of ggplot() function: ggplot(data, aes())
Structure of
aes() function
51
ggplot: Define Data
alpha = transparency of the fill
color
fill
Define y
axis aes(y
=Miles per
Gallon
(mpg))
Define x axis aes(x= Cylinders)
aes(): Define Aesthetics – Color/ fill aes(): Define Aesthetics – Shape 52
2
53
Graph: X and Y Axis
To reorganized the values or labels
appeared in x and y-axis, several
function are used as follows:
1. For numeric variable
scale_x_continuous()
scale_y_continuous()
Structure:
scale_x_continuous(breaks =
seq(1, 7, 1), limits=c(1, 7), label
= scales::percent)
theme_gray()
theme_bw()
theme_linedraw()
theme_light()
theme_dark()
theme_minimal()
theme_classic()
theme_void()
theme_test()
Structure:
theme(text = element_text(color = "navy“,
family="Comic Sans MS", size=16),
panel.background = element_rect(fill = "white"),
panel.grid.major.y = element_line(color = "grey"),
panel.grid.minor.y = element_line(color = "grey",
linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
strip.background = element_rect(fill = "white",
color="grey"))
59
X and Y Axis: Theme
To change text style, alignment of the labels
of x and y-axis, use following function and
structure:
For x-axis:
theme(axis.text.x =
element_text(face="bold", angle = 0, color=
"black", hjust = 1))
For y-axis:
theme(axis.text.y =
element_text(face="bold.italic", angle = 90,
size= 12, color= "black", hjust = 1))
Here:
✓ Face = front style bold/italic
✓ Angle = set up alignment
✓ Size = front size
✓ Color = text color
✓ hjust = distance between text and axis line
60
Best fit line: geom_smooth()
To add a best fitted line, use
function geom_smooth()
se= F
The shaded area represents the confidence interval, indicating the range within which we are fairly certain the true
relationship lies. It shows the level of uncertainty in the prediction; a wider area indicates more uncertainty.
62
Grouping: facet_wrap()
In faceting, a graph consists of several separate plots, one for each level of a variable, or combination
of variables.
You can control row and Column number by using ncol, nrow function.
facet_wrap(~var1, ncol=2, nrow=2)
If you want to create multiple plots based on var1 and var2, then you need to use:
facet_grid(var1~var2)
63
Grouping:
facet_wrap()
1. facet_wrap(~carb)
2. facet_wrap(~carb, ncol=2, nrow=3)
3. facet_grid(vs~carb)
64
Plot Type- 1D, 2D, 3D, 4D…..
If you want to create a 2D graph, then, you need to entry these value –
e. g., color, fill, size in the geom() function,
e.g., geom_point() function.
If you do not entry any of the element value, system will provide by default value.
In case of 3D/4D/5D graph, elements need to place in aes() function based on which you
differentiate data. Others will remind within geom function.
You just need to entry x and y value in aes() function to create a 2D graph.
To incorporate 3D, 4D, or 5D view, then you should incorporate others elements,
e. g., color, fill, size.
Using facet_wrap() function, you can create multiple dimension graph also.
65
Plot - 1D, 2D, 3D, 4D…..
1. 2D
2. 3D
3. 4D
4. 6D
66
The simple graph has brought more information to the 2. Bivariate graphs:
data analyst’s mind than any other device.” — John …display the relationship between two variables. The
Tukey type of graph will depend on the measurement level of the
This chapter will teach you how to visualise your data variables (categorical or quantitative).
using ggplot2. R has several systems for making 1. Categorical vs. Categorical
graphs, but ggplot2 is one of the most elegant and Pie-Donut Chart
most versatile. Bar chart (Stacked, grouped, segmented)
Tiles plot
2. Quantitative vs. Quantitative
1. Univariate graphs: Scattered plot
…plots the distribution of data from a single variable.
The variable can be categorical (e.g., race, sex) or
Line plot
quantitative (e.g., age, weight). 3. Categorical vs. Quantitative
1. Categorical Grouped density plot
Pie chart Grouped histogram
Bar chart Box plot
2. Continuous 3. Multivariate graphs:
Histogram …display the relationships among three or more
Density plot variables. There are two common methods for
Box plot accommodating multiple variables: grouping and faceting
Multivariate Graphs
Quantitative vs. Quantitative Scatter and Line Plot 68
Scattered plot
Basic: ggplot(data, aes(x=Var1, y=Var2))+geom_point()
With other changes:
ggplot(data, aes(x= var1, y= var2, color= varX, alpha= varX, size= varX, shape= varX, stroke= varX))
+geom_point(color= "blue", alpha= 1, size= 1, shape= 1, stroke= 1)
+scale_x_continuous(breaks = seq(1, 7, 1), limits=c(1, 7), label = scales::comma)
+scale_y_continuous(breaks = seq(1, 7, 1), limits=c(1, 7), label = scales::percent)
+labs(title = "Mileage by engine displacement",
subtitle = "Data from 1999 and 2008",
caption = "Source: EPA (http://fueleconomy.gov)",
x = "Engine displacement (litres)",
y = "Highway miles per gallon",
color = "Car Class",
shape = "Year")+
theme_gray()+
+theme(axis.text.x = element_text(face="bold.italic", angle = 0, size=12, color= "black", hjust = 1))
+theme(axis.text.y = element_text(face="bold.italic", angle = 0, size= 12, color= "black", hjust = 1))
+theme(text = element_text(color = "navy", family="Comic Sans MS", size=24))
+geom_smooth(method="loess", se=T)
+facet_wrap(~varX, ncol=2, nrow=2)
Quantitative vs. Quantitative Scatter and Line Plot 70
install.packages("ggplot2")
library(ggplot2)
ggplot(D, aes(x=HappinesIndex, y=GDP_index, color=continent, shape = Region))+
geom_point()+ scale_x_continuous(breaks = seq(2, 8, 2), limits=c(2, 8), label = scales::comma)+
scale_y_continuous(breaks = seq(6, 11, 1), limits=c(6, 11))+
labs(title="GDP vs Happiness", subtitle= "a global study",
caption="Source: author, 2024",
x=" Index value of Happiness",
y= "Index value of GDP", color= "Names of Continent", shape= "Region")+ theme_bw()+
theme(axis.text.x = element_text(face="bold.italic", angle = 45))+
theme(axis.text.y = element_text(face="bold.italic", angle = 30, size= 18, color= "blue", hjust = .0001))+
geom_smooth(method="loess", se=F)+
facet_wrap(~continent, ncol=2, nrow=3)
Line plot
Basic: ggplot(data, aes(x= var1, y= var2))+ geom_line()
Extended structure:
ggplot(data, aes(x= var1, y= var2, color= varX, alpha= varX, size= varX, shape= varX, linetype= varX))+
geom_line(color= "blue", alpha= 1, size= 1, shape= 1, linetype= 1)+
scale_x_continuous(breaks = seq(1, 7, 1), limits=c(1, 7), label = scales::comma)+
scale_y_continuous(breaks = seq(1, 7, 1), limits=c(1, 7), label = scales::comma)+
labs(title = "Mileage by engine displacement",
subtitle = "Data from 1999 and 2008",
caption = "Source: EPA (http://fueleconomy.gov)",
x = "Engine displacement (litres)",
y = "Highway miles per gallon",
color = "Car Class",
shape = "Year")+
theme_gray()+
theme(axis.text.x = element_text(face="bold.italic", angle = 0, size=12, color= "black", hjust = 1))+
theme(axis.text.y = element_text(face="bold.italic", angle = 0, size= 12, color= "black", hjust = 1))+
theme(text = element_text(color = "navy", family="Comic Sans MS", size=24))+
geom_smooth(method="loess", se=T)+
facet_wrap(~varX, ncol=2, nrow=2)
Quantitative vs. Quantitative Scatter and Line Plot 73
install.packages("ggplot2")
library(ggplot2)
ggplot(D, aes(x=year, y=HappinesIndex, color=continent, linetype = Region))+
geom_line()+
scale_x_continuous(breaks = seq(2005, 2020, 5), limits=c(2005, 2020))+
scale_y_continuous(breaks = seq(2,8,2), limits=c(2, 8))+
labs(title="GDP vs Happiness", subtitle= "a global study",
caption="Source: author, 2024",
x=" Index value of Happiness", y= "Index value of GDP",
color= "Names of Continent", shape= "Region")+
theme_bw()+
geom_smooth(method="loess", se=T)+
facet_wrap(~continent, ncol=2, nrow=3)
library(ggplot2)
ggplot(D1, aes(x=year, y=HappinesIndex,
color=continent, linetype = Region))+
geom_line()+geom_point()
Types of Graphs 77
The simple graph has brought more information to the 2. Bivariate graphs:
data analyst’s mind than any other device.” — John …display the relationship between two variables. The
Tukey type of graph will depend on the measurement level of the
This chapter will teach you how to visualise your data variables (categorical or quantitative).
using ggplot2. R has several systems for making 1. Categorical vs. Categorical
graphs, but ggplot2 is one of the most elegant and Pie-Donut Chart
most versatile.
Bar chart (Stacked, grouped,
segmented)
1. Univariate graphs: Tiles plot
…plots the distribution of data from a single variable. 2. Quantitative vs. Quantitative
The variable can be categorical (e.g., race, sex) or Scattered plot
quantitative (e.g., age, weight). Line plot
1. Categorical 3. Categorical vs. Quantitative
Pie chart Grouped density plot
Grouped histogram
Bar chart Box plot
2. Continuous
Histogram 3. Multivariate graphs:
Density plot …display the relationships among three or more
Box plot variables. There are two common methods for
accommodating multiple variables: grouping and faceting
Categorical; Categorical vs. Categorical Pie Chart 1D – single variable 78
✓ A pie chart, sometimes called a circle chart, is a way of summarizing a set of nominal/ categorical data or
displaying the different values of a given variable (e.g. percentage distribution).
General Structure:
library(ggplot2)
library(webr)
library(dplyr)
mydata2<-as.data.frame(table(mydata$Variable))
mydata2
PieDonut(mydata2,aes(Var1, count=Freq),
labelposition=0,
explode = c(2, 5),
r0 = 0,
showPieName=F)
Practice:
library(ggplot2)
library(webr)
library(dplyr)
table<-table(Pedestrian$Intersection)
Table
mydata2<-as.data.frame(table(Pedestrian$Intersection))
mydata2
##or
m1<-data.frame(table(Pedestrian$Intersection))
mydata2$Var1 <- gsub("Shapla Chattar", "Shapla\nChattar", mydata2$Var1)
PieDonut(mydata2, aes(Var1, count=Freq), labelposition=0, explode = c(2, 5,6), r0 = .5, showPieName=TRUE)
Categorical; Categorical vs. Categorical Pie Chart 2D – two variables 79
General Structure
library(ggplot2)
library(webr)
library(dplyr)
table<-table(mydata$Gender, mydata$Pattern)
mydata1<-as.data.frame(table)
Mydata1
PieDonut(mydata1,aes(Var1, Var2, count=Freq),
labelposition=1,
title = "Age by intersection",
ratioByGroup = T,
explode = c(1, 2),
explodeDonut=T,
selected=c(1, 2),
r0 = 0,
r1 = .9, Practice:
showPieName=FALSE) library(ggplot2)
##Remember: explode donut does library(webr)
not work properly in complex situation library(dplyr)
M1<-data.frame(table(Pedestrian$Gender, Pedestrian$Pattern))
PieDonut(M1,aes(Var1, Var2, count=Freq), labelposition=1,
title = "Gender vs Pattern of Crossing", ratioByGroup = T, explode = c(1),
explodeDonut=F, r0 = .3, r1=1, r2=1.5, showPieName=FALSE)
Categorical; Categorical vs. Categorical Pie Chart 2D 81
1st declared
variable will be
placed inside
Categorical; Categorical vs. Categorical Bar chart 1D – Single variable 82
⮕ First you need to create a table describing frequency/ proportion of each category
⮕ Then you need to transfer it into data frame
⮕ You can do without making a table too
table1<-table(data$variable1)
dataframe<- as.data.frame(table1) #for frequency
dataframe<- as.data.frame(prop.table(table1)) #for proportion
Generalized structure:
ggplot(dataframe, aes(x=Var1, y=Freq, fill=Var1))+
geom_bar(stat = "identity", color="black", size=2, width =.5)+
scale_x_discrete(limits = c("Three", "Five", "Four"), labels = c("Three\nGear", "Five\nGear", "Four Gear"))+
scale_y_continuous(breaks = seq(0.1, .6, .1), limits=c(0, 0.5), label = scales::percent/comma)+
labs(title="Car Characteristics", subtitle="Proportion of Gear",
x="Number of Gear", y="Percentage of cars",
caption="Source: Car sale data of Toyota")+
theme_gray()+
theme(axis.text.x = element_text(face="bold", angle = 45, color= "Red", hjust = .51))+
theme(axis.text.y = element_text(face="bold.italic", angle = 45, color= "Red", hjust = .51))+
theme(text = element_text(color = "navy", family="Comic Sans MS", size=24))+
theme(legend.position = "none")+
coord_flip()/coord_cartesian()
Categorical; Categorical vs. Categorical Bar chart 1D – Single variable 83
library(ggplot2)
ggplot(Pedestrian, aes(x=Pattern, fill=Pattern))+geom_bar(stat="count")
table<-table(Pedestrian$Pattern)
df1<- as.data.frame(table) #for frequency
df1
df2<- as.data.frame(prop.table(table)) #for proportion
df2
ggplot(df1, aes(x=Var1, y=Freq))+geom_bar(stat="identity")
1. Grouped (Clustered) Bar Chart: …bars for different sub-categories side-by-side within each category [position = "dodge“]
2. Stacked Bar Chart: …each bar is divided into segments to show sub-categories within each main category [position = "stack“
or do not provide any argument, it is by default result of ggplot’s bar command]
3. Segmented/ proportional/ percent stacked Bar Chart: ..function similarly to stacked bars, but they are used to show
percentages, making each bar reach 100%. [position = "fill“]
3
1 2
Categorical; Categorical vs. Categorical Bar chart 2D – Two variables 86
Practice:
library(readxl)
Pedestrian <- read_excel("E:/1 Ankhi BUET/Pedestrian.xlsx")
View(Pedestrian)
table1<-table(Pedestrian$Gender,Pedestrian$Age)
dataframe<- as.data.frame(table1) #for frequency
dataframe<- as.data.frame(prop.table(table1)) #for proportion
dataframe
Generalized structure:
ggplot(dataframe, aes(x=Var1, y=Freq, fill=Var2))+
geom_bar(stat = "identity", color="black", size=2, width =.5, position = "dodge/fill")+
scale_x_discrete(limits = c("Three", "Four", "Five"), labels = c("Three Gear", "Four Gear", "Five Gear"))+
scale_y_continuous(breaks = seq(0.1, .6, .1), limits=c(0, 0.5), label = scales::percent)+
labs(title="Car Characteristics",
subtitle="Proportion of Gear", x="Number of Gear", y="Percentage of cars",
caption="Source: Car sale data of Toyota", fill="Engine Type")+
theme_economist()+
theme(axis.text.x = element_text(face="bold", angle = 45, color= "Red", hjust = .51))+
theme(axis.text.y = element_text(face="bold.italic", angle = 45, color= "Red", hjust = .51))+
theme(text = element_text(color = "navy", family="Comic Sans MS", size=24)+
coord_flip()/coord_cartesian()+facet_wrap(~Var3, ncol=2, nrow=2)
Categorical; Categorical vs. Categorical Bar chart 4D – Four variables 89
Practice:
Categorical; Categorical vs. Categorical Bar chart 4D – Four variables 91
Generalized structure:
ggplot(dataframe, aes(x=Var1, y=Var2, fill=Freq))+ geom_tile(color= "black")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 6, limit = c(-0,12), space = "Lab",
name="Frequency")+
scale_x_discrete(limits = c("Three", "Four", "Five"), labels = c("Three Gear", "Four Gear", "Five Gear"))+
scale_y_discrete(limits = c("Nine", "Ten", "Eleven"), labels = c ("Nine", "Ten", "Eleven"))+
labs(title="Car Characteristics", subtitle="Proportion of Gear", x="Number of Gear", y="Percentage of cars",
caption="Source: Car sale data of Toyota", fill="Frequency")+
theme_economist()+
theme(axis.text.x = element_text(face="bold", angle = 45, color= "Red", hjust = .51))+
theme(axis.text.y = element_text(face="bold.italic", angle = 45, color= "Red", hjust = .51))+
theme(text = element_text(color = "navy", family="Comic Sans MS", size=24)+ facet_wrap(~Var3)
Categorical; Categorical vs. Categorical Tiles Plot 93
library(readxl)
Pedestrian <- read_excel("E:/1 Faeeza BUET/1. Part Time in
BUET/Pedestrian.xlsx")
View(Pedestrian)
library(ggplot2)
table1<-table(Pedestrian$Age, Pedestrian$Intersection)
dataframe1<- as.data.frame(table1) #for frequency
dataframe2<- as.data.frame(prop.table(table1)) #for proportion
Dataframe1
ggplot(dataframe1, aes(x=Var1, y=Var2, fill=Freq))+geom_tile(color="black")
+scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 5,
limit = c(0,85), space = "Lab", name="Frequency") + theme(text =
element_text(face="bold.italic", color = "navy", family="Comic Sans MS",
size=12))
library(readxl)
Pedestrian <- read_excel("E:/1 Ankhi BUET/1. Part Time in BUET/Plan 296/R/Dataset/Pedestrian.xlsx")
View(Pedestrian)
T1<-as.data.frame(table(Pedestrian$Intersection, Pedestrian$Gender, Pedestrian$Age))
T1
ggplot(T1, aes(x=Var1, y=Var3, fill=Freq))+
geom_tile(color="black")+
scale_fill_gradient2(space = "Lab")+
geom_text(aes(label = round(Freq, 1)))+
facet_wrap(~Var2)
Example:
Day vs. working hour
https://rud.is/b/2016/02/14/
making-faceted-heatmaps-
with-ggplot2/
Types of Graphs 96
The simple graph has brought more information to the 2. Bivariate graphs:
data analyst’s mind than any other device.” — John …display the relationship between two variables. The
Tukey type of graph will depend on the measurement level of the
This chapter will teach you how to visualize your data variables (categorical or quantitative).
using ggplot2. R has several systems for making 1. Categorical vs. Categorical
graphs, but ggplot2 is one of the most elegant and Pie-Donut Chart
most versatile. Bar chart (Stacked, grouped, segmented)
Tiles plot
2. Quantitative vs. Quantitative
1. Univariate graphs: Scattered plot
…plots the distribution of data from a single variable. Line plot
The variable can be categorical (e.g., race, sex) or 3. Categorical vs. Quantitative
quantitative (e.g., age, weight). Grouped histogram
1. Categorical
Pie chart
Grouped density plot
Bar chart Box plot
2. Continuous 3. Multivariate graphs:
Histogram …display the relationships among three or more
Density plot variables. There are two common methods for
accommodating multiple variables: grouping and faceting
Box plot Multivariate Graphs
Continuous; Categorical vs. Quantitative Histogram 97
https://www.youtu
be.com/watch?v=T
xlm4ORI4Gs
Continuous; Categorical vs. Quantitative Histogram 98
General Structure:
ggplot(data, aes(x=Var))+ ##or also add y=..count..
geom_histogram(color="white", fill= "cornflowerblue", bins=30/bandwidth=.3)+
scale_x_continuous(breaks = seq(0, 3, .5), limits=c(0, 3), label = scales::comma)+
scale_y_continuous(breaks = seq(0, 70, 10), limits=c(0, 70), label = scales::comma)+
labs(title="Car Characteristics", subtitle="Proportion of Gear", x="Number of Gear",
y="Percentage of cars",
caption="Source: Car sale data of Toyota")+
theme_gray()+
theme(axis.text.x = element_text(face="bold", angle = 45, color= "Red", hjust = .51))+
theme(axis.text.y = element_text(face="bold.italic", angle = 45, color= "Red", hjust = .51))+
theme(text = element_text(color = "navy", family="Comic Sans MS", size=24)+
facet_wrap(~Var1, ncol=1)
General Structure:
library(scales)
ggplot(Pedestrian ,aes(x=Speed, y=..count.. / sum(..count..)))+
geom_histogram(color="cyan4",fill="cornflowerblue", bins = 5)+
scale_y_continuous(breaks = seq(0,.3,.1), limits=c(0, .3), label = scales::comma)+
theme_bw()+
facet_grid(Age~Gender)
General Structure:
library(scales)
ggplot(Pedestrian ,aes(x=Speed, y=..count.. / sum(..count..)))+
geom_histogram(color="cyan4",fill="cornflowerblue", bins = 5)+
scale_y_continuous(breaks = seq(0,.3,.1), limits=c(0, .3), label = scales::percent)+
theme_bw()+
facet_grid(Age~Gender)
Percent
Continuous; Categorical vs. Quantitative Histogram: Density 105
relative frequency
y-axis density =
width of the bins
General Structure:
library(ggplot2)
library(scales)
ggplot(Pedestrian, aes(x=Speed, y=..density..))+
geom_histogram(color="white", fill= "cornflowerblue",
bins=5)
Need to bring scales package and y component in aes()
function
Continuous; Categorical vs. Quantitative Histogram: Density 106
…relative to other speed values,
The bin representing 0.7 to 1.5 0.7 to 1.5 has the greatest
reaches a height of approximately probability density. A high density
0.6 on the y-axis. This means that here means that if we randomly
about 60% of all observations in select an observation, it’s most
the dataset have speeds within this likely to fall within this
range. This high proportion bandwidth.
emphasizes that most of the
dataset is concentrated around this
speed value
.7 1.5 .7 1.5
Continuous; Categorical vs. Quantitative Histogram: 4 types of axis input 107
General Structure:
1D histogram:
geom_vline(aes(xintercept=mean(Speed)), color="blue", linetype="dashed", size=1)
Practice: ggplot(Pedestrian ,aes(x=Speed,))+
geom_histogram(color="cyan4",fill="cornflowerblue", bins = 5)+
theme_bw()+geom_vline(aes(xintercept=mean(Speed)), color="blue", linetype="dashed", size=1)
In case of more than 1D histogram, you need to create a data frame summarize the mean value
according to the categories as follows:
library(plyr)
V_Mean <- ddply(data,“Gender", summarise, grp.mean=mean(Speed))
Then add structure:
geom_vline(data=V_Mean, aes(xintercept=grp.mean, color=Gender), linetype="dashed",size=2)
Practice:
install.packages("plyr")
library(plyr)
V_mean<-ddply(Pedestrian, "Gender", summarise, grp.mean=mean(Speed))
V_mean
ggplot(Pedestrian , aes(x=Speed, y=..count../sum(..count..)))+
geom_histogram(color="cyan4",fill="cornflowerblue", bins = 5)+ theme_bw()+ facet_wrap(~Gender)+
geom_vline(data=V_mean, aes(xintercept=grp.mean, color=Gender), linetype="dashed",size=2)
Continuous; Categorical vs. Quantitative Kernel Density Plot 110
General Structure:
ggplot(data, aes(x=Var1, fill= Var2, color= Var3))+
geom_density(fill="cornflowerblue",
color="blue", alpha=1, bw=.1)+
scale_x_continuous(breaks = seq(0, 3, .5),
limits=c(0, 3), label = scales::comma)+
scale_y_continuous(breaks = seq(0, .2, .05),
limits=c(0, .2), label = scales::comma)+
labs(title="Car Characteristics",
subtitle="Proportion of Gear",
x="Number of Gear",
y="Density",
caption="Source: Car sale data of
Toyota")+
theme_gray()+
theme(axis.text.x = element_text(face="bold", angle =
45, color= "Red", hjust = .51))+
theme(axis.text.y = element_text(face="bold.italic",
angle = 45, color= "Red", hjust = .51))+
theme(text = element_text(color = "navy",
family="Comic Sans MS", size=24)+
facet_wrap(~Var2, ncol=1)
Continuous; Categorical vs. Quantitative Kernel Density Plot 111
…relative to other speed values,
0.5 to 1.5 has the greatest
General Structure:
probability density. A high density
library(ggplot2)
here means that if we randomly
library(scales)
select an observation, it’s most
ggplot(Pedestrian, aes(x=Speed, y=..density..))+
likely to fall within this
geom_histogram(color="white", fill=
bandwidth.
"cornflowerblue", bins=5)+
geom_density()
Histogram with density on y axis or
density curve both represent same output
library(ggplot2)
ggplot(Pedestrian,aes(x=Speed))+geom_density(fill="cornflowerblue",
color="blue", alpha=.1)+ theme_bw()+
geom_vline(aes(xintercept=mean(Speed)), color="blue",
linetype="dashed", size=1)
library(plyr)
V_mean<-ddply(Pedestrian, "Gender", summarise, grp.mean=mean(Speed))
V_mean
ggplot(Pedestrian, aes(x=Speed, fill = Gender, alpha=.1))+geom_density()+
geom_vline(data=V_mean, aes(xintercept=grp.mean, color=Gender),
linetype="dashed",size=2)
Continuous; Categorical vs. Quantitative Kernel Density Plot - Add Mean Line 114
How??
Practice:
library(plyr)
V_mean<-ddply(Pedestrian, "Age",
summarise, grp.mean=mean(Speed))
V_mean
ggplot(Pedestrian, aes(x=Speed,
fill=Age))+geom_density(alpha=.1)+
geom_vline(data=V_mean,
aes(xintercept=grp.mean, color=Age),
linetype="dashed",size=2)+
facet_wrap(~Age)+
theme_bw()
Continuous; Categorical vs. Quantitative Histogram and Density Plot Together 115
Density
Count
General Structure:
library(plyr)
V_mean <- ddply(Pedestrian, "Age", summarise, grp.mean =
mean(Speed))
V_mean
ggplot(Pedestrian, aes(x = Speed, fill = Age)) +
geom_histogram(aes(y = ..density..), alpha = .2, color = "blue") +
geom_density(alpha = .1) +
geom_vline(data = V_mean, aes(xintercept = grp.mean, color = Age)) + Visually not pleasant, misleading
theme_bw() + facet_wrap(~Age) to some extent
Continuous; Categorical vs. Quantitative Histogram and Density Plot Together 116
Practice:
library(plyr)
data<-Pedestrian
V_Mean <- ddply(data, "Gender", summarise, mean=mean(Speed))
V_Mean
library(plyr)
data<-Pedestrian
V_Mean <- ddply(data, "Gender", summarise,
mean=mean(Speed))
V_Mean
Install.packages(“plotly”)
library(plotly)
plot<-ggplot(data, aes(x=Speed, fill=Gender))+
geom_histogram(aes(y=..density..),
alpha=.7,color="black", bins=40, size=.5)+
geom_density(alpha=.4)+
geom_vline(data=V_Mean,
aes(xintercept=mean, color=Gender))+
facet_wrap(~Gender)
Explore – Not for exam 118
library(plyr)
V_Mean <-ddply(data,“Gender",
summarise, grp.mean=mean(Speed))
3. Mosaic Plot:
5. Chord diagram:
Explore – Not for exam 121