0% found this document useful (0 votes)
14 views122 pages

R1 Uptovisualisation

The document provides a comprehensive guide on analyzing data using R, covering topics such as downloading R and RStudio, database preparation, variable creation, and data manipulation with the dplyr package. It includes practical examples and functions for creating and managing datasets, as well as exporting data to various formats. The guide emphasizes the importance of data type conversion, recoding, and transforming variables for effective data analysis.

Uploaded by

18261abdurrob
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views122 pages

R1 Uptovisualisation

The document provides a comprehensive guide on analyzing data using R, covering topics such as downloading R and RStudio, database preparation, variable creation, and data manipulation with the dplyr package. It includes practical examples and functions for creating and managing datasets, as well as exporting data to various formats. The guide emphasizes the importance of data type conversion, recoding, and transforming variables for effective data analysis.

Uploaded by

18261abdurrob
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 122

Analyzing Data with R

Plan 298: Data Analytics

Prepared By:
Nur-E-Faeeza Ankhi
Lecturer, DURP, BUET

Acknowledgement:
Niaz Mahmud Zafri
Assistant Professor, DURP, BUET

1
Analyzing Data with R
Plan 298: Data Analytics

Topic 1: Introduction and Database Preparation in R

2
3
R Download, Install and Opening
Go to the link: https://posit.co/download/rstudio-desktop/
✓ 1st download and install latest version of R base
✓ Then download and install latest version of R Studio Desktop
(RStudio can be downloaded for free and R base is required to use RStudio)

Now open R Studio (a R script)


File > New File > R script
Or, “+” sign >
Once the file has opened, save as follow
File -> Save
Specify a name: the extension .R is automatically added
RStudio can be used as calculator
4
R Studio Interface

Environment

Source
Source can work
as calculator

Console
Also shows if
there is any error
Output
5
Database Preparation

R case sensitive. Age, age, aGE are different variable in R


6
Variable Creation
Assignment operator: Results of calculations can be stored in objects/variables using the assignment operators-
✓ An arrow (<−) formed by a ‘smaller than character’ and a ‘hyphen without a space’
✓ The equal character (=).

There are some restrictions when giving an object/variable a name:


✓ Object names cannot contain `strange' symbols like !, +, -, #.
✓ A dot (.) and an underscore (_) are allowed, also a name starting with a dot.
✓ Object names can contain a number but cannot start with a number.
✓ R is case sensitive, X and x are two different objects, as well as temp and temP.

Format of creating a variable/object (single observation):


Structure: “Variable name (Var1)” “assignment operator (<−)” “value (7)”
Example of code: Var1<−7

Format of creating a variable (vector):


Structure: “Variable name” “assignment operator” “c(value)”
Example of code: Var2<−c(7,8,10,12,18,24)
7
Variable Creation
Data type
Integer: whole number:- Var2<− c(7:10,12,18,24)
Numeric: whole number/decimal:- var3 <− c(7, 8, 10.99, 12.12, 18.908, 24.8)
Character: string:- name <− c(“Rahim”, “Karim”, “Rahman”, “Ashik”, “Mizan”, “Alam”)
Factor: data with define category:- gender <− factor(c("female", "female", "male", "female", "male“, “male”))
Logical: True or False:- var2<− c(“T", “F", “T", “F", “F“, “T”)
The missing value symbol is NA

To provide detailed label with variable use comment() function


comment(gender) <− “Gender of the respondent”
To view the variable in console use print() function
print(gender)
gender

To check data type use class () function


class(gender)
To check number of observation in a variable use length () function
length(gender)
To check the levels of a factor type variable use levels() function
levels(gender)
To view unique values in a variable, use unique() function
unique(gender)
8
Database Preparation
⮕ data.frame() function used for creating database
⮕ Observations in all the variables need to be equal to prepare a database using data.frame() function
Structure: “Database name (df)” “assignment operator (<−)” “data.frame(var1, var2, var3)”
Example of code: df<− data.frame(name, gender, Var2, var3)

To check:
Structure of the database use str() function: str(df) [shows (observation, variables), variable types]
Dimension of the database use dim() function: dim(df) [shows (observation, variables)]
To open the database use View() function: View(df)
To print the database in console use print() function or type the database name only: print(df) or df
To check top n observations use head() function: head(df, n=20) [head(df,20)]
To check bottom n observations use tail() function: tail(df, n=25)
To access a particular variable among the dataset use $ sign: df$name

Do Practice:
Gender Age Age_Group Residence Zone Type Education Occu Occupation Mon_per_income
Female 19 to 30 Young Daskhin Khan Less Sensitive Zone SSC/HSC Student Student 5,000-19,999
Male 46 to 65 Old Demra Less Sensitive Zone Graduation Business Employed 1,00,000 and above
Male 46 to 65 Old Jatrabari High Sensitive Zone SSC/HSC Labor/Daily Job Employed 5,000-19,999
Male 31 to 45 Middle-aged Sabujbagh High Sensitive Zone SSC/HSC Private Service Employed 5,000-19,999
Female 19 to 30 Young Shyampur High Sensitive Zone Graduation Others Employed 5,000-19,999
Female 19 to 30 Young Mohammadpur High Sensitive Zone Graduation Student Student 20,000-39,999
Male 31 to 45 Middle-aged Lalbagh High Sensitive Zone Graduation Student Student 20,000-39,999
Male 46 to 65 Old Rampura High Sensitive Zone More than graduation Private Service Employed 60,000-99,999
Male 19 to 30 Young Rampura High Sensitive Zone SSC/HSC Student Student 5,000-19,999
9
Database Preparation
Do Practice: #Create Age variable: Integer
Age<-c(28, 33, 43, 55, 67, 21, 32, 43, 65, 45)
Age
class(Age)
length(Age)
#Create Weight variable: Numeric
Weight<-c(77.6, 56.7, 60.9, 55.9, 72.6, 68.4, 66.2, 54.7, 75.3, 71.4)
Weight
class(Weight)
length(Weight)
#Create Health Condition variable: Health Condition
Health_condition<-factor(c("Very Good", "Bad","Moderate","Good",
"Good", "Very Bad", "Bad", "Moderate", "Very Good", "Good"))
Health_condition
class(Health_condition)
Solution: length(Health_condition)
##Create database #Create database
#Create ID variable: Character database1<-data.frame(ID, Name, Gender, Age, Weight,
ID<-c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10") Health_condition)
ID
database1
class(ID)
length(ID)
str(database1)
#Create Name variable: Character dim(database1)
Name<-c("Firoz", "Rafiq", "Fatema", "Akter", "Nafi", "Sumon", "sadia", head(database1, 5)
"Sommo", "Sirin", "parvin") tail(database1, 5)
Name View(database1)
class(Name) #Create a database by using ID, Age, Weight, Health_condition of "database1" dataframe
length(Name) database2<-data.frame(database1$ID, database1$Age,database1$Health_condition)
#Create Gender variable: Factor database2
Gender<-factor(c("Male", "Male", "Female", "Female", "Male", str(database2)
"Male","Female", "Male", "Female", "Female")) dim(database2)
Gender
head(database2, 5)
class(Gender)
length(Gender)
tail(database2, 5)
View(database2)
Analyzing Data with R
Plan 298: Data Analytics

Topic 2: Manipulating data with R

10
11
Database Preparation
Spreadsheet-like view
First, create a blank database using data.frame() function by providing a name of that database
Use edit() or fix() function to create variable and enter data
If you use edit() function you have to use assignment operator to save the database according to its name
To rename variable, you can use these functions
In summary, fix() modifies the original object directly, while edit() returns a modified copy that needs to be reassigned.
Example - R script:
travel_database <-data.frame()
travel_database <- edit(travel_database)
travel_database
edit(travel_database)
travel_database
fix(travel_database)
travel_database

Matrix format
The numeric values in the data frame might alternatively be stored in a matrix with the same dimensions, i.e., 5 rows × 2 columns.
No variable name. Data store following row and column number.
Data frame to matrix conversion:
Structure: “New matrix database name” “<−” “as.matrix(data frame name)”
Example: DF_Matrix <−as.matrix(travel_database)
12
Database: Export/Save
First, you need to set directory where you want to save the file. It can be done through setwd()
function or manually using the steps:
Session→ Set working directory→ Choose directory→ Select appropriate directory.
To know the existing working directory use getwd() function. Also see: file.choose() function

Save as CSV file:


You can save a dataset to CSV using the write_csv() function in readr package.
Structure: write_csv(name of the database want to export, “name of the database for saving.csv”)
Example:
install.packages("readr")
library(readr)
write_csv(travel_database, "travel_database.csv")

Save as Excel file:


Structure: write_xlsx(name of the database want to export, “name of the database for saving.xlsx”)
Example:
getwd()
install.packages("writexl")
library(writexl) Notice!! You need
write_xlsx(travel_database, "travel_database.xlsx") to alter the slash
setwd("E:/1 Ankhi BUET/1. Part Time in BUET/Plan 296/R/21Oct, 2024")
13
Import Excel/SPSS dataset in R
To import a excel/SPSS dataset follow the procedure:
Environment→ Import dataset→ From excel/From SPSS→ Browse the excel/SPSS file from the storing directory→ Open→
Code preview→ Copy codes from code preview→ Cancel→ Paste copied code in the R source→ Run
14
Open CSV/Text dataset in R
To import a CSV/Text dataset follow the procedure:
Environment→ Import dataset→ From text (base)→ Browse the excel/SPSS file from the storing directory→ Open→
Heading (Yes)→ Import
15
Packages and Functions
✓ A package is a collection of functions.
✓ An R package bundles together code, data, documentation, and tests, and is easy to share with others.
✓ One of the reasons that R is so successful is that there is large variety of packages available.
✓ There is a high chance that someone has already solved a problem similar to what you’re working on, and you can benefit
from their work by downloading their package

Structure:
install.packages(“Package name”)
library(Package name)
Function()

Example by “dplyr” package:


install.packages(“dplyr”)
library(dplyr)
newdata <- select(starwars, name, mass, species)

To get details about a package, use help tool: - Structure: ?Package name
Example: ?ggplot2

You can get details of the packages in CRAN webpage


Link: https://cran.r-project.org/
16
Manipulating data: dplyr package
✓ Data manipulation is the process of arranging a set of data to make it more organized and easier to interpret.
✓ The processes of cleaning your data can be the most time-consuming part of any data analysis.
✓ While there are many approaches, those using the dplyr packages are some of the quickest and easiest to learn
17
1. Data type conversion
If you need to change the data type for any column, use the following functions:
as.character() converts to a text string
travel_data$car_ownership <- as.character(travel_data$car_ownership)
as.numeric() converts to a number
travel_data$car_ownership <- as.numeric(travel_data$car_ownership)
as.factor() converts to a factor
Pedestrian$Gender<-as.factor(Pedestrian$Gender)

2. Recode
Character Variable
Automatically recode using as.factor() function (alphabetical order)
travel_data$gender <- as.factor(travel_data$gender)
To recode according to your necessity use factor() function and input levels and labels within the function
travel_data$gender <- factor(travel_data$gender, levels= c(“Male”, “Female”), labels=c(“Man”, “Woman”))
If the character variable is a ordinal variable then use factor() function and input levels, labels, and ordered within the
function
travel_database$Quality<-factor(travel_database$Quality, levels = c(1, 2, 3),
labels c("bad","moderate","good"), ordered = T)
18

2. Recode
Numeric Variable
To recode according to your necessity use: dplyr packages
mutate() function
“New database name” <- “mutate(existing database name, “new variable
name”= “ifelse(condition, return if TRUE, return if FALSE)”)”.
ifelse() conditional function
#recode-------------------
#Package install
install.packages(dplyr)
library(dplyr)
#create variable-----------------------------------
var1<-c(1, 5, 6, 7, 12, 1, 17, 7, 5, 9)
var2<-c("F", "M", "F", "Z", "M", "M", "Z", "F", "F", "M")
#Database creation---------------------------------
data<-data.frame(var1, var2)
data
#Recode var1--------------------------------------
#if var1<10 then code it good, otherwise bad
data1<-mutate(data, recode_var1= ifelse(var1<10, "good", "bad"))
data1
#if var1<7 then code it good, var1 is in between 7-14 then moderate, otherwise bad
data2<-mutate(data, recode_var2= ifelse(var1<7, "good", ifelse(var1>=7 & var1<14, "moderate", "bad")))
data2
#Recode var2---------------------------------------
#if var2 is M/F then Human, otherwise animal
data3<-mutate(data, recode_var3= ifelse(var2=="F"|var2=="M", "human", "animal"))
data3
19

3. Transform Variable
The mutate() function of dplyr package allows you to create new variables or
transform existing ones.
#Transform variable------------------------
#Package install
install.packages(dplyr)
library(dplyr)
#Variable creation-------------------------
var0<-1:10
var1<-c(1, 5, 6, 7, 12, 1, 17, 7, 5, 9)
var2<-c("F", "M", "F", "Z", "M", "M", "Z", "F", "F", "M")
var3<-c("X", "X", "Y", "Y", "X", "X", "Y", "Y", "X", "X")
var4<-c(1.5, 5.5, 4.9,1.9, 8.5, 3.3, 4.5, 5.9, 8.1, 3.7)
#Dataset preparation-----------------------
data1<-data.frame(var0, var1, var2, var3, var4)
data1
#Create a new variable:var1*5---------------
data2<-mutate(data1,var5=var1*5)
data2
#Create two new variable: var4/4, var1+var4------
data3<-mutate(data1,var6=var4/5, var7=var1+var4)
data3
#Create three variables: log10(var0), var1^2, sqrt(var4)------
data4<-mutate(data1,var8=log10(var0), var9=var1^2, var10=sqrt(var4))
data4
20
4. Filter Columns
The select() function of dplyr packages allows you to limit your dataset to specified variables (columns).
keep the variables name, height, and gender of ‘starwars’ dataset
newdata <- select(starwars, name, height, gender)
keep the variables name and all variables between mass and species inclusive of ‘starwars’ dataset
newdata <- select(starwars, name, mass:species)
keep all variables except birth_year and gender of ‘starwars’ dataset
newdata <- select(starwars, -birth_year, -c(gender:species))

5. Filter Row
The filter() function of dplyr packages allows you to limit your dataset to observations (rows) meeting a specific criteria.
Multiple criteria can be combined with the & (AND) and | (OR) symbols

-- select only the female observations


newdata <- filter(starwars, gender == "female")
-- select only the female observations living in Dhaka
newdata <- filter(starwars, gender == "female" & place== " Dhaka")
-- select only the female observations living in Dhaka or Chittagong
newdata <- filter(starwars, gender == "female”& (place== " Dhaka " | place== “ Chittagong" ))
21
4. Filter
#Filter------------------------------------ #Keep var3=X and var1<10 observations
#Variable creation------------------------- data6<-filter(data1,var3=="X"& var1<10)
var0<-1:10 data6
var1<-c(1, 5, 6, 7, 12, 1, 17, 7, 5, 9) #Keep var3=X, var1<6 or>11 observation
var2<-c("F", "M", "F", "Z", "M", "M", "Z", "F", "F", "M") data7<-filter(data1,var3=="X"&(var1<6|var1>11))
var3<-c("X", "X", "Y", "Y", "X", "X", "Y", "Y", "X", "X") data7
var4<-c(1.5, 5.5, 4.9,1.9, 8.5, 3.3, 4.5, 5.9, 8.1, 3.7) #Keep var3=X, var1<6 or>11, var4<6 observations
#Dataset preparation----------------------- data7<-filter(data1,var3=="X"&(var1<6|var1>11)&var4<6)
data1<-data.frame(var0, var1, var2, var3, var4) data7
data1
#Filter columns-----------------------------
install.packages(dplyr)
library(dplyr)
# keep the variables var0, var2, and var3
data2<-select(data1, var0, var2, var3)
data2
# keep the variables var0, var2 to var4
data3<-select(data1, var0, var2:var4)
data3
# keep the variables expect var2 and var4
data4<-select(data1, -var2, -var4)
data4
22
5. Filter/Delete
To remove a variable from the dataset, use NULL operator
Variable name<-NULL
data1$var0<-NULL
Remove observation from the dataset by following way:
You want to remove 2, 4, and 6 rows from myData dataset.
myData <- myData[-c(2, 4, 6), ]
You want to remove 2 to 6 rows from myData1 dataset
myData1 <- myData1[-c(2:6), ]

To sort a variable, use View() function. Then, click on the icon near variable name
View(data1)
23
6. Merge Dataset
Merging datasets is commonly required when data on single units are stored in multiple tables or datasets.

We consider a simple example where variables id, year, female, and inc are available in one dataset, and variables id and
maxval in a second.
data3 <- merge(data1, data2, by="id", all=TRUE)

If you want to merge two datasets having same variables name,


use rbind() function:
data3<-rbind(data1, data2)
24
Practice
1. Print: 1 to 10
ID<-c(1:10)
ID
2. Print: 1 to 5, 78 to 81, 3, -1, 60 to 57
ID<-c(1:5, 78:81, 3, -1, 60:57)
ID
## preparing database------------------------------------------------------------------------------------------------------------------
travel_database <-data.frame()
travel_database <- edit(travel_database)
travel_database
edit(travel_database) ##do not use it if you want to continue with the edited database
fix(travel_database)
##save dataset-------------------------------------------------------------------------------------------------------------
getwd()
setwd("E:/1 Ankhi BUET/1. Part Time in BUET/Plan 296/R/21Oct, 2024")
install.packages("writexl")
library(writexl)
write_xlsx(travel_database, "travel_database.xlsx")
##recode----------------------------------------------------------------------------------------------------------------------------------------
class(travel_database$Name)
travel_database$Name<-as.factor(travel_database$Name)
class(travel_database$Name)
travel_database$Name<-as.character(travel_database$Name)
class(travel_database$Age)
travel_database$Age<-as.character(travel_database$Age)
class(travel_database$Age)
travel_database$Age<-as.numeric(travel_database$Age)
class(travel_database$Age)
25
Practice
##Recoding by factor function--------------------------------------------------------------------------------------------------- -
travel_database$Gender<-factor(travel_database$Gender, levels= c("Male", "Female"), labels=c("Man", "Woman"))
travel_database
travel_database$Vehicle_Quality<-factor(travel_database$Vehicle_Quality, levels=c(0,1,2,3,4,5), labels=c("NA", “very bad", "bad",
"neutral", "good", "very good"), ordered= T)
travel_database
travel_database$Vehicle_Quality<-factor(travel_database$Vehicle_Quality, labels=c(0,1,2,3,4,5), levels=c("NA", "very bad", "bad",
"neutral", "good", "very good"), ordered= T) ##ordered T indicates ranking
travel_database

##new variable by mutate function-------------------------------------------------------------------------------------------------------------------


install.packages("dplyr")
library(dplyr)
class(travel_database$`Number of vehicle`)
travel_database$`Number of vehicle`<-as.numeric(travel_database$`Number of vehicle`)
travel_database<-mutate(travel_database, recode_ownership=ifelse(`Number of vehicle` ==0,"does not own","owns"))
travel_databasetravel_database<-mutate(travel_database, recode_quality=ifelse(Vehicle_Quality==0, "NA", ifelse(Vehicle_Quality>=1 &
Vehicle_Quality<=2,"bad",ifelse(Vehicle_Quality==3,"neutral","good"))))
travel_database
## transform by mutate---------------------------------------------------------------------------------------------------------- --
travel_database<-mutate(travel_database, Var1=Age*2) ##Create a new variable:Age*2
travel_databasetravel_database<-mutate(travel_database, var2=Var1/4, var3=Var1+Age) ##Create two variables: var2=var1/4, var1+Age
travel_databasetravel_database<-mutate(travel_database, var4=log10(Var1), var5=Var1^2, var6=sqrt(Var1))
###Create three variables: log10(var0), var1^2, sqrt(var4)
travel_database
26
Practice
##filter column by select-------------------------------------------------------------------------------------
travel_database <- select(travel_database, -Gender, -c(Var1:var6)) ##keep all variables except Gender and var1 to var6
travel_databasenew_Data<-select(travel_database, Name, Age, c(recode_ownership:var9))
##keep the variables name, Age, and gender and recode_ownership to var9 in a newdataset
new_Data
##filter row by filter function
new_Data1<-filter(travel_database, Gender=="Female" & (Number_of_vehicle==1 | Age>=20))
##select only the female and owns only 1 vehicle or above 20years oldnew_Data1
##merge------------------------------------------------------------------------------------------------------------------------- ---------
students <- data.frame( ID = c(1, 2, 3, 4), Name = c("Alice", "Bob", "Charlie", "David"), Age = c(20, 22, 21, 23)) # Sample data 1
scores <- data.frame( ID = c(1, 2, 3, 5), Score = c(85, 90, 88, 76)) # Sample data frame 2
merged_data <- merge(students, scores, by = "ID") ##default merge (only rows with common values are merged)
merged_data
merged_data <- merge(students, scores, by = "ID", all.x = TRUE) ##Keeps all rows from the left data frame (students), adding NA for scores
merged_data
merged_data <- merge(students, scores, by = "ID", all.y = TRUE) ##Keeps all rows from the right data frame (scores), adding NA for students
merged_data
merged_data <- merge(students, scores, by = "ID", all= TRUE) ##Keeps all rows from both data frames, filling in NA for missing matches
merged_data
##merge data sets with same variables-------------------------
df1 <- data.frame( ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie"), Age = c(20, 22, 21))
df2 <- data.frame( ID = c(4, 5), Name = c("David", "Eva"), Age = c(23, 24))
combined_data <- rbind(df1, df2)
combined_data
Analyzing Data with R
Plan 298: Data Analytics

Topic 3: Descriptive Analysis

27
28
Descriptive Statistics
✓ Descriptive statistics summarize the main features of a dataset. They provide a way to
understand the basic characteristics of data through measures of central tendency, variability
(spread), and distribution shape
✓ Does not draw conclusions or make inferences about a population
1. Percentile Values
2. Central Tendency
i. Mean
ii. Median
iii. Mode
iv. Sum
3. Dispersion
i. Standard deviation
ii. Variance
iii. Maximum
iv. Minimum
v. Range
4. Distribution
i. Skewness
ii. kurtosis
29
1. Percentile Values
Numerical data can be sorted in increasing or decreasing order. Thus the values of a numerical data set have a rank
order. A percentile is the value at a particular rank.

The nth percentile of a dataset is the value


that cuts off the first n% of the data values
when all of the values are sorted from
least to greatest 1-99 segments
30
1. Percentile Values

Quartiles
The Quartiles also divide the data into
divisions of 25% (4 division), so:
Quartile 1 (Q1) can be called the 25th
percentile
Quartile 2 (Q2) can be called the 50th
percentile
Quartile 3 (Q3) can be called the 75th
percentile

Quartile 1 (Q1 = 4): This represents the


25th percentile of the data. It means that
25% of the values are less than or equal to
4. This gives us an idea of the lower range
of the dataset.
31
2. Central Tendency
Mean
The "mean" is the "average" you're used to, where you add up all the numbers and then divide by the
number of numbers

Median
The "median" is the "middle" value in the list of numbers. To find the median, your numbers have to be
listed in numerical order from smallest to largest, so you may have to rewrite your list before you
can find the median.

Mode
The "mode" is the value that occurs most often. If no number in the list is repeated, then there is no mode
for the list.

Sum/ Total
32
3. Dispersion
Standard deviation: These
measures indicate that, on average,
the data points deviate from the
mean by approximately the ‘value of
Standard deviation’ units Individual deviation
✓ A summary measure of the from mean
differences of each
observation from the mean Avg deviation from
✓ Small Std. Dev. means more mean = Std dev Mean
values near mean, less Maximum
spread of histogram
Range
Variance: Square of Std. Dev.
Maximum: Maximum value of the
data set
Minimum: Minimum value of the
data set
Range: Maximum-Minimum

Minimum
33
4. Distribution
Skewness:
…measures the asymmetry of the data distribution around its mean
✓ If skewness is close to zero, it suggests a normal distribution, where data is symmetrically spread around the mean
(within -1 to +1)
✓ A positive skewness indicates that the data has a longer tail on the right (more values are concentrated on the left -
For skewness value > 1 =right skewed, positive skewed)
✓ Negative skewness means a longer left tail (more values are concentrated on the right- skewness < -1 = left
skewed distribution, negative skewed)

Necessity
✓ Many statistical methods (e.g., regression, t-tests) assume that the data is normally distributed. Skewness can
indicate whether this assumption is valid
✓ Identifying outliers
✓ Skewness provides insight into the underlying patterns of the data. For example, if the data represents income
distribution, a positive skew might indicate that a few individuals earn significantly more than the rest
✓ In fields such as finance, environmental studies, and transportation, data can often be skewed. Understanding this
skewness can improve model selection and predictive accuracy
34
4. Distribution
Skewness:
35
4. Distribution
Skewness: https://www.youtube.com/watch?v=XSSRrVMOqlQ&t=57s
36
4. Distribution
Kurtosis:
…measures the tailedness and peak shape of a
data distribution. It helps identify whether a
dataset has more or fewer extreme values than
expected in a normal distribution
✓ Kurtosis > 0: Leptokurtic (positive
value)
✓ Kurtosis=0: Normal
✓ Kurtosis <0: Platykurtic (negative
value)

Necessity
✓ Many statistical methods assume a normal
distribution (mesokurtic). Knowing the
kurtosis helps assess whether this
assumption is valid
✓ High kurtosis can indicate the presence of
extreme outliers, which may need to be
addressed in data analysis or modeling.
37
In Built Function for Descriptive Statistics
Use picture functions to calculate required descriptive statistics

Case of missing values:


✓ For example, the mean() function will return NA if even only one
value is missing in a vector.
✓ This can be avoided using the argument na.rm = TRUE, which
tells to the function to remove any Nas before calculations.
✓ An example using the mean function is as follow:
mean(my_data$Sepal.Length, na.rm = TRUE)

Also check fivenum() for finding Min, Q1, Median, Q3, Max

To find specific quantile suppose 90%, use:


quantile(data7$var1, 0.90)
38
Packages for Descriptive Statistics
39
Descriptive analysis: R Script
#Variable creation------------------------- median(data1$var1) #Descriptive statistics using pastecs-----------
var0<-1:20 median(data1$var1, na.rm = TRUE) install.packages("pastecs")
var1<-c(1:5, 57:55, NA, 6:14, NA, NA) quantile(data1$var1,.55) library(pastecs)
var2<-c("F", "M", "F", "Z", "M", "M", "Z", "F", quantile(data1$var1,.55, na.rm = T) stat.desc(data1)
"F", "M", NA, "M", "F", "Z", "M", "M", "Z", quantile(data1$var1,.75)
"F", "F", "M") quantile(data1$var1,.75, na.rm = T) #Categorical variable frequency: var2------
var3<-c("X", "X", "Y", "Y", "X", "X", "Y", "Y", fivenum(data1$var1) #by default handles NA variabletable1<-table(var2)
"X", "X","X", "X", "Y", "Y", "X", "X", "Y", Table1
"Y", "X", "X") #Descriptive statistics using summary()------ #Categorical variable frequency: var2 variable
#Dataset preparation----------------------- summary(data1) using summarytools package-----------
data1<-data.frame(var0, var1, var2, var3) #Descriptive statistics using Hmise----------- install.packages("summarytools")
data1 install.packages("Hmisc") library(summarytools)
library(Hmisc) freq(data1$var2)
#Individually Mean, Median, Quantile: 55%, describe(data1)
75%, Five number: var0: no NA-------
mean(data1$var0) #Descriptive statistics using psych----------- #Categorical variable percentage: var2
median(data1$var0) install.packages("psych") variableprop.table(table1)
quantile(data1$var0,.55) #55th percentile library(psych) #Crosstab (frequency, row%, column%,
quantile(data1$var0,.75) describe(data1) total%): var2*var3
fivenum(data1$var0) #showing minimum, Q1, #Group Descriptive statistics using psych by table2<-table(data1$var2,data1$var3)
Q2, Q3, maximum var2----------- table2
#Individually Mean, Median, Quantile: 55%, install.packages("psych") prop.table(table2,1)
75%, Five number : var1: 1 obs NA------ library(psych) prop.table(table2,2)
mean(data1$var1) describe.by(data1,group = data1$var2) prop.table(table2)
mean(data1$var1, na.rm=TRUE) #groupwise summary
40
Descriptive analysis: Save the file
getwd()
setwd("E:/1 Ankhi BUET/1. Part Time in BUET/Plan 296/R/21Oct, 2024") ## must define location first

install.packages("psych")
library(psych)
describe(Pedestrian) ## you can use other functions too for summary
desc <- describe(Pedestrian)

# Convert to a data frame, extracting relevant details


desc_df <- as.data.frame(desc)
desc_df

# Save as a CSV file--------------------------------------------------------------


write.csv(desc_df, file = "describe_output.csv")
# Save as a excel file-------------------------------------------------------
install.packages("writexl")
library(writexl)
write_xlsx(desc_df, path = "descriptive_stats.xlsx")
# Run the describe function and save summary output to a text file-------
# Use Hmisc::describe() to explicitly call the describe
sink("describe_output.txt") ##an empty text box creation initially
function from the Hmisc package (if gets confused
describe(Pedestrian)
with psych)
sink() # Stop redirecting the output (replacing the empty box)
desc <- Hmisc::describe(Pedestrian)
desc <- describe(Pedestrian)
41
Descriptive analysis: Save the file
42
Cross tab
A crosstab, or cross-tabulation, is a statistical tool used to analyze the
relationship between two or more categorical variables by creating a
matrix (table)

Formulas for computing row percentage


43
Cross tab
A crosstab, or cross-tabulation, is a statistical tool used to analyze the
relationship between two or more categorical variables by creating a
matrix (table)

Formulas for computing column percentage


44
Cross tab
A crosstab, or cross-tabulation, is a statistical tool used to analyze the
relationship between two or more categorical variables by creating a
matrix (table)

Formulas for computing row percentage


45
Cross tab
You can generate frequency tables using the table() library(readxl)
function, tables of proportions using Pedestrian <- read_excel("E:/1 Ankhi BUET/1. Part Time in BUET/Plan
the prop.table( ) function, and marginal frequencies 296/R/Pedestrian.xlsx")
using margin.table( ). View(Pedestrian)

Suppose, we have two variables named A and B. ##Create table: Gender will be rows, Pattern will be columns
Then, R script will be: mytable <- table(Pedestrian$Gender,Pedestrian$Pattern)
Create table: A will be rows, B will be columns Mytable
mytable <- table(A,B) ##Gender frequencies (summed over pattern)
A frequencies (summed over B) margin.table(mytable, 1)
margin.table(mytable, 1) ##Pattern frequencies (summed over gender)
B frequencies (summed over A) margin.table(mytable, 2)
margin.table(mytable, 2) ##Total percentages
table1<-prop.table(mytable)
Total percentages round(table1,2)
prop.table(mytable) ##Row percentage
Row percentage prop.table(mytable, 1)
prop.table(mytable, 1) ##Column percentage
Column percentage prop.table(mytable, 2)
prop.table(mytable, 2)
8+1298/137*100
Analyzing Data with R
Plan 298: Data Analytics

Topic 4: Data Visualization with R


Check:
https://rkabacoff.github.io/datavis/
https://bookdown.org/content/b298e479-b1ab-49fa-b83d-a57c2b034d49/distributions.html

46
47

Data Visualization: Basic Grammar


48
Data Visualization with R
The simple graph has brought more information to the 2. Bivariate graphs:
data analyst’s mind than any other device.” — John …display the relationship between two variables. The
Tukey type of graph will depend on the measurement level of the
This chapter will teach you how to visualize your data variables (categorical or quantitative).
using ggplot2. R has several systems for making 1. Categorical vs. Categorical
graphs, but ggplot2 is one of the most elegant and Pie-Donut Chart
most versatile. Bar chart (Stacked, grouped, segmented)
Tiles plot
2. Quantitative vs. Quantitative
1. Univariate graphs: Scattered plot
…plots the distribution of data from a single variable. Line plot
The variable can be categorical (e.g., race, sex) or 3. Categorical vs. Quantitative
quantitative (e.g., age, weight). Grouped density plot
1. Categorical Grouped histogram
Pie chart Box plot
Bar chart
2. Continuous 3. Multivariate graphs:
Histogram …display the relationships among three or more
Density plot variables. There are two common methods for
Box plot accommodating multiple variables: grouping and faceting
Multivariate Graphs
49
Graph type
Different functions used for different graph type:

Structure of geom function (geom_bar() taken as example):


geom_bar(color= “blue”, fill= “red”, alpha= 1, size= 1, shape= 1, stroke= 1, width= 1, linetype= 1)

✓ If you want to create a 2D graph, then, you need to entry these value.
✓ If you do not entry any of the element value, system will provide by default value.
✓ In case of 3D/4D/5D graph, elements need to place in aes() function based on which you differentiate data.
✓ Others will remind within geom function
50
ggplot: Define Data
-- Need to use ggplot() function to define data and aes() function to built up the structure of the graph
Structure of ggplot() function: ggplot(data, aes())

Structure of aes() function:


aes(x= var1, y= var2, color= varX, fill= varX, alpha= varX, size= varX, shape= varX, stroke= varX, width= varX,
linetype= varX)
You just need to entry x and y value to create a 2D graph. To incorporate 3D, 4D, or 5D view, then you should incorporate others
elements.

Structure of
aes() function
51
ggplot: Define Data
alpha = transparency of the fill
color

fill

stroke = thickness of border of point

Define y
axis aes(y
=Miles per
Gallon
(mpg))
Define x axis aes(x= Cylinders)
aes(): Define Aesthetics – Color/ fill aes(): Define Aesthetics – Shape 52

aes(): Define Aesthetics – Linetypes


6

2
53
Graph: X and Y Axis
To reorganized the values or labels
appeared in x and y-axis, several
function are used as follows:
1. For numeric variable
scale_x_continuous()
scale_y_continuous()
Structure:
scale_x_continuous(breaks =
seq(1, 7, 1), limits=c(1, 7), label
= scales::percent)

2. For categorical variable


scale_x_discrete
scale_y-discrete
Structure:
scale_x_discrete(limits = c("pickup",
"suv", "minivan", "midsize",
"compact", "subcompact", "2seater"),
labels = c("Pickup\nTruck", "Sport
Utility\nVehicle", "Minivan", "Mid-
size", "Compact", "Subcompact", "2-
Seater"))
54
Graph: X and Y Axis
55
Add labels: Title, x-axis title and so on
To add labels, use labs() function.
Structure:
labs(title = "Mileage by engine displacement", subtitle = "Data from
1999 and 2008", caption = "Source: EPA (http://fuele.gov)",
x = "Engine displacement (litres)",
y = "Highway miles per gallon",
color = "Car Class", shape = "Year")
56
Change Background: Theme
If you want to change
the background, you
have to added another
theme function as
follows:

theme_gray()
theme_bw()
theme_linedraw()
theme_light()
theme_dark()
theme_minimal()
theme_classic()
theme_void()
theme_test()

If you do not need to do


any advance
modification, you do not
need to
Using theme_bw()
write anything in the
theme function.
57
Change Background: Theme
We can also get a wide verity
of themes by installing
“ggthemes” package.
58
Change Background: Theme – Customized one
You can customize theme according to your wish using
theme() function
⮕ Change label text from black to navy blue, front to
Comic Sans MS and size to 16
⮕ Change the panel background color from grey to
white
⮕ Add solid grey lines for major y-axis grid lines
⮕ Add dashed grey lines for minor y-axis grid lines
⮕ Eliminate x-axis grid lines
⮕ Change the strip background color to white with a
grey border

Structure:
theme(text = element_text(color = "navy“,
family="Comic Sans MS", size=16),
panel.background = element_rect(fill = "white"),
panel.grid.major.y = element_line(color = "grey"),
panel.grid.minor.y = element_line(color = "grey",
linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
strip.background = element_rect(fill = "white",
color="grey"))
59
X and Y Axis: Theme
To change text style, alignment of the labels
of x and y-axis, use following function and
structure:

For x-axis:
theme(axis.text.x =
element_text(face="bold", angle = 0, color=
"black", hjust = 1))

For y-axis:
theme(axis.text.y =
element_text(face="bold.italic", angle = 90,
size= 12, color= "black", hjust = 1))

Here:
✓ Face = front style bold/italic
✓ Angle = set up alignment
✓ Size = front size
✓ Color = text color
✓ hjust = distance between text and axis line
60
Best fit line: geom_smooth()
To add a best fitted line, use
function geom_smooth()

You have to define method based


on which line will appear.

⮕ For linear model:


geom_smooth(method="lm")

⮕ Generalized additive model 1. lm; 2. gam; 3. glm; 4. loess


(GAM)
geom_smooth(method="gam")

⮕ Generalized Linear Model


(GLM)
geom_smooth(method="glm")

⮕ Locally weighted smoothing


geom_smooth(method="loess")
61
Add best fit line: geom_smooth()
If you do not want to keep the shade of confidence interval around smooth line, then add: se= FALSE
geom_smooth(method="lm", se=F)

se= F

The shaded area represents the confidence interval, indicating the range within which we are fairly certain the true
relationship lies. It shows the level of uncertainty in the prediction; a wider area indicates more uncertainty.
62
Grouping: facet_wrap()
In faceting, a graph consists of several separate plots, one for each level of a variable, or combination
of variables.

Structure (if want to create multiple plots based on var1):


facet_wrap(~var1)

You can control row and Column number by using ncol, nrow function.
facet_wrap(~var1, ncol=2, nrow=2)

If you want to create multiple plots based on var1 and var2, then you need to use:
facet_grid(var1~var2)
63
Grouping:
facet_wrap()

1. facet_wrap(~carb)
2. facet_wrap(~carb, ncol=2, nrow=3)
3. facet_grid(vs~carb)
64
Plot Type- 1D, 2D, 3D, 4D…..

If you want to create a 2D graph, then, you need to entry these value –
e. g., color, fill, size in the geom() function,
e.g., geom_point() function.

If you do not entry any of the element value, system will provide by default value.
In case of 3D/4D/5D graph, elements need to place in aes() function based on which you
differentiate data. Others will remind within geom function.

You just need to entry x and y value in aes() function to create a 2D graph.
To incorporate 3D, 4D, or 5D view, then you should incorporate others elements,
e. g., color, fill, size.

Do not provide color name or size here.


Write the variable name based on which graph will be transferred from 2D to 3D.

Using facet_wrap() function, you can create multiple dimension graph also.
65
Plot - 1D, 2D, 3D, 4D…..

1. 2D
2. 3D
3. 4D
4. 6D
66

Data Visualization: Plotting Graph


Types of Graphs 67

The simple graph has brought more information to the 2. Bivariate graphs:
data analyst’s mind than any other device.” — John …display the relationship between two variables. The
Tukey type of graph will depend on the measurement level of the
This chapter will teach you how to visualise your data variables (categorical or quantitative).
using ggplot2. R has several systems for making 1. Categorical vs. Categorical
graphs, but ggplot2 is one of the most elegant and Pie-Donut Chart
most versatile. Bar chart (Stacked, grouped, segmented)
Tiles plot
2. Quantitative vs. Quantitative
1. Univariate graphs: Scattered plot
…plots the distribution of data from a single variable.
The variable can be categorical (e.g., race, sex) or
Line plot
quantitative (e.g., age, weight). 3. Categorical vs. Quantitative
1. Categorical Grouped density plot
Pie chart Grouped histogram
Bar chart Box plot
2. Continuous 3. Multivariate graphs:
Histogram …display the relationships among three or more
Density plot variables. There are two common methods for
Box plot accommodating multiple variables: grouping and faceting
Multivariate Graphs
Quantitative vs. Quantitative Scatter and Line Plot 68

✓ Express relation between two continuous variables


✓ X axis = continuous variable, Y axis = continuous variable
Quantitative vs. Quantitative Scatter and Line Plot 69

Scattered plot
Basic: ggplot(data, aes(x=Var1, y=Var2))+geom_point()
With other changes:
ggplot(data, aes(x= var1, y= var2, color= varX, alpha= varX, size= varX, shape= varX, stroke= varX))
+geom_point(color= "blue", alpha= 1, size= 1, shape= 1, stroke= 1)
+scale_x_continuous(breaks = seq(1, 7, 1), limits=c(1, 7), label = scales::comma)
+scale_y_continuous(breaks = seq(1, 7, 1), limits=c(1, 7), label = scales::percent)
+labs(title = "Mileage by engine displacement",
subtitle = "Data from 1999 and 2008",
caption = "Source: EPA (http://fueleconomy.gov)",
x = "Engine displacement (litres)",
y = "Highway miles per gallon",
color = "Car Class",
shape = "Year")+
theme_gray()+
+theme(axis.text.x = element_text(face="bold.italic", angle = 0, size=12, color= "black", hjust = 1))
+theme(axis.text.y = element_text(face="bold.italic", angle = 0, size= 12, color= "black", hjust = 1))
+theme(text = element_text(color = "navy", family="Comic Sans MS", size=24))
+geom_smooth(method="loess", se=T)
+facet_wrap(~varX, ncol=2, nrow=2)
Quantitative vs. Quantitative Scatter and Line Plot 70

Scattered plot - Example


library(readxl) ##calling data set
world_happiness_report_data <- read_excel("E:/1 Ankhi BUET/1. Part Time in BUET/Plan 296/R/Dataset/world-happiness-
report_data.xlsx")
View(world_happiness_report_data)

D<-world_happiness_report_data ##changed dataset name


D

install.packages("ggplot2")
library(ggplot2)
ggplot(D, aes(x=HappinesIndex, y=GDP_index, color=continent, shape = Region))+
geom_point()+ scale_x_continuous(breaks = seq(2, 8, 2), limits=c(2, 8), label = scales::comma)+
scale_y_continuous(breaks = seq(6, 11, 1), limits=c(6, 11))+
labs(title="GDP vs Happiness", subtitle= "a global study",
caption="Source: author, 2024",
x=" Index value of Happiness",
y= "Index value of GDP", color= "Names of Continent", shape= "Region")+ theme_bw()+
theme(axis.text.x = element_text(face="bold.italic", angle = 45))+
theme(axis.text.y = element_text(face="bold.italic", angle = 30, size= 18, color= "blue", hjust = .0001))+
geom_smooth(method="loess", se=F)+
facet_wrap(~continent, ncol=2, nrow=3)

ggplot(D, aes(x=HappinesIndex, y=GDP_index))+


geom_point()+geom_smooth(method="loess", se=T)+
facet_grid(YearDummy~Region)
Quantitative vs. Quantitative Scatter and Line Plot 71

Scattered plot - Example


Quantitative vs. Quantitative Scatter and Line Plot 72

Line plot
Basic: ggplot(data, aes(x= var1, y= var2))+ geom_line()

Extended structure:
ggplot(data, aes(x= var1, y= var2, color= varX, alpha= varX, size= varX, shape= varX, linetype= varX))+
geom_line(color= "blue", alpha= 1, size= 1, shape= 1, linetype= 1)+
scale_x_continuous(breaks = seq(1, 7, 1), limits=c(1, 7), label = scales::comma)+
scale_y_continuous(breaks = seq(1, 7, 1), limits=c(1, 7), label = scales::comma)+
labs(title = "Mileage by engine displacement",
subtitle = "Data from 1999 and 2008",
caption = "Source: EPA (http://fueleconomy.gov)",
x = "Engine displacement (litres)",
y = "Highway miles per gallon",
color = "Car Class",
shape = "Year")+
theme_gray()+
theme(axis.text.x = element_text(face="bold.italic", angle = 0, size=12, color= "black", hjust = 1))+
theme(axis.text.y = element_text(face="bold.italic", angle = 0, size= 12, color= "black", hjust = 1))+
theme(text = element_text(color = "navy", family="Comic Sans MS", size=24))+
geom_smooth(method="loess", se=T)+
facet_wrap(~varX, ncol=2, nrow=2)
Quantitative vs. Quantitative Scatter and Line Plot 73

Line plot - Example


library(readxl) ##calling data set
world_happiness_report_data <- read_excel("E:/1 Ankhi BUET/1R/Dataset/world-happiness-report_data.xlsx")
View(world_happiness_report_data)

D<-world_happiness_report_data ##changed dataset name


D

install.packages("ggplot2")
library(ggplot2)
ggplot(D, aes(x=year, y=HappinesIndex, color=continent, linetype = Region))+
geom_line()+
scale_x_continuous(breaks = seq(2005, 2020, 5), limits=c(2005, 2020))+
scale_y_continuous(breaks = seq(2,8,2), limits=c(2, 8))+
labs(title="GDP vs Happiness", subtitle= "a global study",
caption="Source: author, 2024",
x=" Index value of Happiness", y= "Index value of GDP",
color= "Names of Continent", shape= "Region")+
theme_bw()+
geom_smooth(method="loess", se=T)+
facet_wrap(~continent, ncol=2, nrow=3)

ggplot(D, aes(x=year, y=Life_expectancy))+


geom_line(linetype=3, color="cyan4")+
geom_smooth(method="loess", se=T)+
facet_wrap(~Region)
Quantitative vs. Quantitative Scatter and Line Plot 74

Line plot - Example


Quantitative vs. Quantitative Scatter and Line Plot 75

Scattered and Line plot together


Generalized structure:
ggplot(data, aes(x= var1, y= var2, color= varX, alpha= varX, size= varX, shape= varX, stroke=varX, linetype= varX))
+geom_line(color= "blue", alpha= 1, size= 1, shape= 1, linetype= 1)
+geom_point(color= “red", alpha= 1, size= 1, shape= 1, stroke= 1)+
scale_x_continuous(breaks = seq(1, 7, 1), limits=c(1, 7), label = scales::comma)+
scale_y_continuous(breaks = seq(1, 7, 1), limits=c(1, 7), label = scales::comma)+
labs(title = "Mileage by engine displacement",
subtitle = "Data from 1999 and 2008",
caption = "Source: EPA (http://fueleconomy.gov)",
x = "Engine displacement (litres)",
y = "Highway miles per gallon",
color = "Car Class",
shape = "Year")+
theme_gray()+
theme(axis.text.x = element_text(face="bold.italic", angle = 0, size=12, color= "black", hjust = 1))+
theme(axis.text.y = element_text(face="bold.italic", angle = 0, size= 12, color= "black", hjust = 1))+
theme(text = element_text(color = "navy", family="Comic Sans MS", size=24))+
geom_smooth(method="loess", se=T)+
facet_wrap(~varX, ncol=2, nrow=2)
Quantitative vs. Quantitative Scatter and Line Plot 76

Scattered and Line plot together


Draw a graph!!
⮕ Use Bangladesh Data
only
⮕ Show the relationship
between GDP and
corruption (point plot
only)
⮕ Show year wise GDP of
Bangladesh (Plot both
scatter and line plots)
⮕ Draw the best fitted line
library(dplyr)
D1<-filter(D, Country=="Bangladesh")
D1
ggplot(D1, aes(x=D1$GDP_index, y=D1$Corruption,
color=continent, linetype = Region))+ geom_point()

library(ggplot2)
ggplot(D1, aes(x=year, y=HappinesIndex,
color=continent, linetype = Region))+
geom_line()+geom_point()
Types of Graphs 77

The simple graph has brought more information to the 2. Bivariate graphs:
data analyst’s mind than any other device.” — John …display the relationship between two variables. The
Tukey type of graph will depend on the measurement level of the
This chapter will teach you how to visualise your data variables (categorical or quantitative).
using ggplot2. R has several systems for making 1. Categorical vs. Categorical
graphs, but ggplot2 is one of the most elegant and Pie-Donut Chart
most versatile.
Bar chart (Stacked, grouped,
segmented)
1. Univariate graphs: Tiles plot
…plots the distribution of data from a single variable. 2. Quantitative vs. Quantitative
The variable can be categorical (e.g., race, sex) or Scattered plot
quantitative (e.g., age, weight). Line plot
1. Categorical 3. Categorical vs. Quantitative
Pie chart Grouped density plot
Grouped histogram
Bar chart Box plot
2. Continuous
Histogram 3. Multivariate graphs:
Density plot …display the relationships among three or more
Box plot variables. There are two common methods for
accommodating multiple variables: grouping and faceting
Categorical; Categorical vs. Categorical Pie Chart 1D – single variable 78

✓ A pie chart, sometimes called a circle chart, is a way of summarizing a set of nominal/ categorical data or
displaying the different values of a given variable (e.g. percentage distribution).
General Structure:
library(ggplot2)
library(webr)
library(dplyr)
mydata2<-as.data.frame(table(mydata$Variable))
mydata2
PieDonut(mydata2,aes(Var1, count=Freq),
labelposition=0,
explode = c(2, 5),
r0 = 0,
showPieName=F)

Practice:
library(ggplot2)
library(webr)
library(dplyr)
table<-table(Pedestrian$Intersection)
Table
mydata2<-as.data.frame(table(Pedestrian$Intersection))
mydata2
##or
m1<-data.frame(table(Pedestrian$Intersection))
mydata2$Var1 <- gsub("Shapla Chattar", "Shapla\nChattar", mydata2$Var1)
PieDonut(mydata2, aes(Var1, count=Freq), labelposition=0, explode = c(2, 5,6), r0 = .5, showPieName=TRUE)
Categorical; Categorical vs. Categorical Pie Chart 2D – two variables 79

Shows two categorical variable proportion distribution

✓ Need to add three packages: ggplot2, dplyr, webr


✓ The Pie-Donut chart will be build using the function PieDonut() from the webr package.
✓ Aesthetics aes() [ggplot2] where we define the two categorical variables
✓ If you want to place the labels for donuts inside, set the labelposition argument 0.
✓ To place all labels outside, set the labelposition argument 1.
✓ ratioByGroup: TRUE means donut % total will be 100%; FALSE means donut % total will be
same as pie slice
✓ It is also possible to disjoint one or more pie slice using “explode” mentioning category number
which define through alphabetical order.
✓ It is also possible to disjoint one or more donut using explodeDonut=T and selected them
mentioning donut number in selected=c(1,2,3). If you do not use selected function, them all the
donuts of the exploded pie slice will be disjoint.
✓ Finally, let’s see how to control the radius of the pie and the donut with the arguments r0, r1 and r2.
If not defined, the values are r0 = 0.3, r1 = 1. r2 = 1.2.
Categorical; Categorical vs. Categorical Pie Chart 2D 80

General Structure
library(ggplot2)
library(webr)
library(dplyr)

table<-table(mydata$Gender, mydata$Pattern)
mydata1<-as.data.frame(table)
Mydata1
PieDonut(mydata1,aes(Var1, Var2, count=Freq),
labelposition=1,
title = "Age by intersection",
ratioByGroup = T,
explode = c(1, 2),
explodeDonut=T,
selected=c(1, 2),
r0 = 0,
r1 = .9, Practice:
showPieName=FALSE) library(ggplot2)
##Remember: explode donut does library(webr)
not work properly in complex situation library(dplyr)
M1<-data.frame(table(Pedestrian$Gender, Pedestrian$Pattern))
PieDonut(M1,aes(Var1, Var2, count=Freq), labelposition=1,
title = "Gender vs Pattern of Crossing", ratioByGroup = T, explode = c(1),
explodeDonut=F, r0 = .3, r1=1, r2=1.5, showPieName=FALSE)
Categorical; Categorical vs. Categorical Pie Chart 2D 81

1st declared
variable will be
placed inside
Categorical; Categorical vs. Categorical Bar chart 1D – Single variable 82

⮕ First you need to create a table describing frequency/ proportion of each category
⮕ Then you need to transfer it into data frame
⮕ You can do without making a table too
table1<-table(data$variable1)
dataframe<- as.data.frame(table1) #for frequency
dataframe<- as.data.frame(prop.table(table1)) #for proportion
Generalized structure:
ggplot(dataframe, aes(x=Var1, y=Freq, fill=Var1))+
geom_bar(stat = "identity", color="black", size=2, width =.5)+
scale_x_discrete(limits = c("Three", "Five", "Four"), labels = c("Three\nGear", "Five\nGear", "Four Gear"))+
scale_y_continuous(breaks = seq(0.1, .6, .1), limits=c(0, 0.5), label = scales::percent/comma)+
labs(title="Car Characteristics", subtitle="Proportion of Gear",
x="Number of Gear", y="Percentage of cars",
caption="Source: Car sale data of Toyota")+
theme_gray()+
theme(axis.text.x = element_text(face="bold", angle = 45, color= "Red", hjust = .51))+
theme(axis.text.y = element_text(face="bold.italic", angle = 45, color= "Red", hjust = .51))+
theme(text = element_text(color = "navy", family="Comic Sans MS", size=24))+
theme(legend.position = "none")+
coord_flip()/coord_cartesian()
Categorical; Categorical vs. Categorical Bar chart 1D – Single variable 83

library(ggplot2)
ggplot(Pedestrian, aes(x=Pattern, fill=Pattern))+geom_bar(stat="count")

table<-table(Pedestrian$Pattern)
df1<- as.data.frame(table) #for frequency
df1
df2<- as.data.frame(prop.table(table)) #for proportion
df2
ggplot(df1, aes(x=Var1, y=Freq))+geom_bar(stat="identity")

ggplot(df2, aes(x=Var1, y=Freq, fill=Var1))+


geom_bar(stat="identity", color="cyan4", size=2, width = .62)+
scale_y_continuous(breaks = seq(0, .8, 0.1), limits = c(0,.8), labels = scales::percent)+
scale_x_discrete(limits = c("Running", "Walking"), labels = c(“fast\nrun", "walk"))+
labs(title="Intersection Crossing Patter", x="pattern of crossing", y="Percentage of respondents")+
theme_linedraw()+
theme(axis.text.x = element_text(face="bold", angle = 45, color= "Red", hjust = .51))+
theme(axis.text.y = element_text(face="bold.italic", angle = 45, color= "blue", hjust = .51))+
theme(text = element_text(color = "cyan4", family="Comic Sans MS", size=14))+
theme(legend.position = "none")+ coord_flip()/coord_cartesian()
Categorical; Categorical vs. Categorical Bar chart 2D – Two variables 84

⮕ Express two categorical variables.


⮕ First you need to create a table describing frequency/ proportion of each category.
⮕ Then you need to transfer it into data frame.
table1<-table(data$variable1, data$variable2)
dataframe<- as.data.frame(table1) #for frequency
dataframe<- as.data.frame(prop.table(table1)) #for proportion
Generalized structure:
ggplot(dataframe, aes(x=Var1, y=Freq, fill=Var2))+
geom_bar(stat = "identity", color="black", size=2, width =.5, position = "dodge/fill")+
scale_x_discrete(limits = c("Three", "Four", "Five"), labels = c("Three Gear", "Four Gear", "Five Gear"))+
scale_y_continuous(breaks = seq(0.1, .6, .1), limits=c(0, 0.5), label = scales::percent)+
labs(title="Car Characteristics", subtitle="Proportion of Gear",
x="Number of Gear", y="Percentage of cars", caption="Source: Car sale data of Toyota",
fill="Engine Type")+
theme_economist()+
theme(axis.text.x = element_text(face="bold", angle = 45, color= "Red", hjust = .51))+
theme(axis.text.y = element_text(face="bold.italic", angle = 45, color= "Red", hjust = .51))+
theme(text = element_text(color = "navy", family="Comic Sans MS", size=24))+
coord_flip()/coord_cartesian()
Categorical; Categorical vs. Categorical Bar chart 85

1. Grouped (Clustered) Bar Chart: …bars for different sub-categories side-by-side within each category [position = "dodge“]
2. Stacked Bar Chart: …each bar is divided into segments to show sub-categories within each main category [position = "stack“
or do not provide any argument, it is by default result of ggplot’s bar command]
3. Segmented/ proportional/ percent stacked Bar Chart: ..function similarly to stacked bars, but they are used to show
percentages, making each bar reach 100%. [position = "fill“]
3
1 2
Categorical; Categorical vs. Categorical Bar chart 2D – Two variables 86

Practice:
library(readxl)
Pedestrian <- read_excel("E:/1 Ankhi BUET/Pedestrian.xlsx")
View(Pedestrian)

table1<-table(Pedestrian$Gender,Pedestrian$Age)
dataframe<- as.data.frame(table1) #for frequency
dataframe<- as.data.frame(prop.table(table1)) #for proportion
dataframe

ggplot(dataframe, aes(x=Var1, y=Freq, fill=Var2))+


geom_bar(stat = "identity", color="black", size=1,
width =.7, position = "dodge")+
scale_y_continuous(breaks = seq(0.1, .6, .1),
limits=c(0, 0.5), label = scales::percent)+
labs(title="Gender and Age group",
x="Gender", y="Proportion", fill="Age group")+
theme_minimal()+
theme(axis.text.x = element_text(face="bold",
angle = 90, color= "Red", hjust = 0))+
theme(axis.text.y = element_text(face="bold.italic",
angle = 90, color= "Red", hjust = 0))+
theme(text = element_text(color = "navy",
family="Comic Sans MS", size=14))+
coord_flip()
Categorical; Categorical vs. Categorical Bar chart 2D – Two variables 87
Categorical; Categorical vs. Categorical Bar chart 3D – Three variables 88

Express three categorical variables.


First you need to create a table describing frequency/ proportion of each category.
Then you need to transfer it into data frame.
table1<-table(data$variable1, data$variable2, data$variable3)
dataframe<- as.data.frame(table1) #for frequency
dataframe<- as.data.frame(prop.table(table1)) #for proportion

Generalized structure:
ggplot(dataframe, aes(x=Var1, y=Freq, fill=Var2))+
geom_bar(stat = "identity", color="black", size=2, width =.5, position = "dodge/fill")+
scale_x_discrete(limits = c("Three", "Four", "Five"), labels = c("Three Gear", "Four Gear", "Five Gear"))+
scale_y_continuous(breaks = seq(0.1, .6, .1), limits=c(0, 0.5), label = scales::percent)+
labs(title="Car Characteristics",
subtitle="Proportion of Gear", x="Number of Gear", y="Percentage of cars",
caption="Source: Car sale data of Toyota", fill="Engine Type")+
theme_economist()+
theme(axis.text.x = element_text(face="bold", angle = 45, color= "Red", hjust = .51))+
theme(axis.text.y = element_text(face="bold.italic", angle = 45, color= "Red", hjust = .51))+
theme(text = element_text(color = "navy", family="Comic Sans MS", size=24)+
coord_flip()/coord_cartesian()+facet_wrap(~Var3, ncol=2, nrow=2)
Categorical; Categorical vs. Categorical Bar chart 4D – Four variables 89

ggplot(Pedestrian, aes(x=Gender, fill=Age))+


geom_bar(stat = "count", position = "fill")+
labs(title="Gender - Age group when 'position = fill' ",
x="Gender", y="Percentage of respondents",
fill="Age group")+
scale_y_continuous(breaks = seq(0.1, 1, .1),
limits=c(0, 1), label = scales::percent)+
facet_grid(Pedestrian$Pattern~Pedestrian$Group)

Notice: you can plot such graphs either directly


using variables from the original dataset or by creating
a table/data frame based on the original data.
Categorical; Categorical vs. Categorical Bar chart 3D – Four variables 90

Practice:
Categorical; Categorical vs. Categorical Bar chart 4D – Four variables 91

Practice: Bar plot possible to convert to 4D through facet_grid() function


Categorical; Categorical vs. Categorical Tiles Plot 92

⮕ Express two categorical variables.


⮕ First you need to create a table describing frequency/ proportion of each category.
⮕ Then you need to transfer it into data frame
⮕ Also called Color Intensity (Heatmap values)
table1<-table(data$variable1, data$variable2)
dataframe<- as.data.frame(table1) #for frequency
dataframe<- as.data.frame(prop.table(table1)) #for proportion

Generalized structure:
ggplot(dataframe, aes(x=Var1, y=Var2, fill=Freq))+ geom_tile(color= "black")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 6, limit = c(-0,12), space = "Lab",
name="Frequency")+
scale_x_discrete(limits = c("Three", "Four", "Five"), labels = c("Three Gear", "Four Gear", "Five Gear"))+
scale_y_discrete(limits = c("Nine", "Ten", "Eleven"), labels = c ("Nine", "Ten", "Eleven"))+
labs(title="Car Characteristics", subtitle="Proportion of Gear", x="Number of Gear", y="Percentage of cars",
caption="Source: Car sale data of Toyota", fill="Frequency")+
theme_economist()+
theme(axis.text.x = element_text(face="bold", angle = 45, color= "Red", hjust = .51))+
theme(axis.text.y = element_text(face="bold.italic", angle = 45, color= "Red", hjust = .51))+
theme(text = element_text(color = "navy", family="Comic Sans MS", size=24)+ facet_wrap(~Var3)
Categorical; Categorical vs. Categorical Tiles Plot 93

library(readxl)
Pedestrian <- read_excel("E:/1 Faeeza BUET/1. Part Time in
BUET/Pedestrian.xlsx")
View(Pedestrian)
library(ggplot2)
table1<-table(Pedestrian$Age, Pedestrian$Intersection)
dataframe1<- as.data.frame(table1) #for frequency
dataframe2<- as.data.frame(prop.table(table1)) #for proportion
Dataframe1
ggplot(dataframe1, aes(x=Var1, y=Var2, fill=Freq))+geom_tile(color="black")
+scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 5,
limit = c(0,85), space = "Lab", name="Frequency") + theme(text =
element_text(face="bold.italic", color = "navy", family="Comic Sans MS",
size=12))

Interpretation: This is a heatmap that shows the frequency


distribution of two variables, labeled "Var1" and "Var2."Var1
(horizontal axis) appears to represent age groups: "Children,"
"Middle Aged," "Old," and "Young."Var2 (vertical axis) represents
various locations: "Symoli," "Shapla Chattar," "Karwan Bazar,"
"Jasimuddin," "Gulshan 2," and "Bangla Motor.“
Color Intensity (Heatmap values): The color gradient from white to
red indicates frequency, with red representing higher frequencies
and lighter colors indicating lower frequencies.
Observations: The "Young" group at “Gulsan 2" has the highest
frequency, as indicated by the darkest red cell. The "Middle Aged"
group show relatively higher frequencies at all intersections other
than kawran and Gulsan 2, but not as high as the "Young" group at
most of the intersections. The "Children" and "Old" groups show
relatively lower frequencies across most locations
**go through the row or column 1st, then interpret the result of a
cell, crisscrossing both axis, meaning variables.
Categorical; Categorical vs. Categorical Tiles Plot 94

library(readxl)
Pedestrian <- read_excel("E:/1 Ankhi BUET/1. Part Time in BUET/Plan 296/R/Dataset/Pedestrian.xlsx")
View(Pedestrian)
T1<-as.data.frame(table(Pedestrian$Intersection, Pedestrian$Gender, Pedestrian$Age))
T1
ggplot(T1, aes(x=Var1, y=Var3, fill=Freq))+
geom_tile(color="black")+
scale_fill_gradient2(space = "Lab")+
geom_text(aes(label = round(Freq, 1)))+
facet_wrap(~Var2)

Lab = Checks whether the transitions between colors


appear smoother and more natural to human eye
-- Other options are RGB and HCL, these are not supported in this version
Geom_text= shows the values in Text
Categorical; Categorical vs. Categorical Tiles Plot 95

Example:
Day vs. working hour

https://rud.is/b/2016/02/14/
making-faceted-heatmaps-
with-ggplot2/
Types of Graphs 96

The simple graph has brought more information to the 2. Bivariate graphs:
data analyst’s mind than any other device.” — John …display the relationship between two variables. The
Tukey type of graph will depend on the measurement level of the
This chapter will teach you how to visualize your data variables (categorical or quantitative).
using ggplot2. R has several systems for making 1. Categorical vs. Categorical
graphs, but ggplot2 is one of the most elegant and Pie-Donut Chart
most versatile. Bar chart (Stacked, grouped, segmented)
Tiles plot
2. Quantitative vs. Quantitative
1. Univariate graphs: Scattered plot
…plots the distribution of data from a single variable. Line plot
The variable can be categorical (e.g., race, sex) or 3. Categorical vs. Quantitative
quantitative (e.g., age, weight). Grouped histogram
1. Categorical
Pie chart
Grouped density plot
Bar chart Box plot
2. Continuous 3. Multivariate graphs:
Histogram …display the relationships among three or more
Density plot variables. There are two common methods for
accommodating multiple variables: grouping and faceting
Box plot Multivariate Graphs
Continuous; Categorical vs. Quantitative Histogram 97

Example 1: weight of 50 students Example 2: weight of 20 students

https://www.youtu
be.com/watch?v=T
xlm4ORI4Gs
Continuous; Categorical vs. Quantitative Histogram 98

Example 1: weight of 20 students

Interpretation of density plot:


https://www.youtube.com/watch?v=Txlm4ORI4Gs
https://www.khanacademy.org/math/ap-statistics/densitycurves-
normal-distribution-ap/density-curves/v/density-curves
Continuous; Categorical vs. Quantitative Density Curve 99
Continuous; Categorical vs. Quantitative Density Curve 100

Infinite number of intervals Must lie on or above the


horizontal axis
Continuous; Categorical vs. Quantitative Density Curve So it represents the area 101
of any interval

Total area of 150 lb weight = 0

So it represents the area


of any interval
Continuous; Categorical vs. Quantitative Histogram: Frequency 102

General Structure:
ggplot(data, aes(x=Var))+ ##or also add y=..count..
geom_histogram(color="white", fill= "cornflowerblue", bins=30/bandwidth=.3)+
scale_x_continuous(breaks = seq(0, 3, .5), limits=c(0, 3), label = scales::comma)+
scale_y_continuous(breaks = seq(0, 70, 10), limits=c(0, 70), label = scales::comma)+
labs(title="Car Characteristics", subtitle="Proportion of Gear", x="Number of Gear",
y="Percentage of cars",
caption="Source: Car sale data of Toyota")+
theme_gray()+
theme(axis.text.x = element_text(face="bold", angle = 45, color= "Red", hjust = .51))+
theme(axis.text.y = element_text(face="bold.italic", angle = 45, color= "Red", hjust = .51))+
theme(text = element_text(color = "navy", family="Comic Sans MS", size=24)+
facet_wrap(~Var1, ncol=1)

⮕bins means number of class you want


⮕bandwidth means class width
⮕If you do not enter bins or bandwidth, system will assign the value
⮕To make graph more than 1D, use fill/color options within aes() and provide
position=“identity” in geom_histogram()
Continuous; Categorical vs. Quantitative Histogram: Proportion or Relative Frequency 103

General Structure:
library(scales)
ggplot(Pedestrian ,aes(x=Speed, y=..count.. / sum(..count..)))+
geom_histogram(color="cyan4",fill="cornflowerblue", bins = 5)+
scale_y_continuous(breaks = seq(0,.3,.1), limits=c(0, .3), label = scales::comma)+
theme_bw()+
facet_grid(Age~Gender)

**Need to bring scales package and y component in aes() function


Continuous; Categorical vs. Quantitative Histogram: Percentage 104

General Structure:
library(scales)
ggplot(Pedestrian ,aes(x=Speed, y=..count.. / sum(..count..)))+
geom_histogram(color="cyan4",fill="cornflowerblue", bins = 5)+
scale_y_continuous(breaks = seq(0,.3,.1), limits=c(0, .3), label = scales::percent)+
theme_bw()+
facet_grid(Age~Gender)

**Need to bring scales package and y component in aes() function

Percent
Continuous; Categorical vs. Quantitative Histogram: Density 105

✓ A density curve is a curve that is always on or


above the horizontal axis, has area exactly 1
underneath it.
✓ A density curve describes the overall pattern of
a distribution. The area under the curve and
above any range of values is the proportion of
all observations that fall in that range.

relative frequency
y-axis density =
width of the bins
General Structure:
library(ggplot2)
library(scales)
ggplot(Pedestrian, aes(x=Speed, y=..density..))+
geom_histogram(color="white", fill= "cornflowerblue",
bins=5)
Need to bring scales package and y component in aes()
function
Continuous; Categorical vs. Quantitative Histogram: Density 106
…relative to other speed values,
The bin representing 0.7 to 1.5 0.7 to 1.5 has the greatest
reaches a height of approximately probability density. A high density
0.6 on the y-axis. This means that here means that if we randomly
about 60% of all observations in select an observation, it’s most
the dataset have speeds within this likely to fall within this
range. This high proportion bandwidth.
emphasizes that most of the
dataset is concentrated around this
speed value

.7 1.5 .7 1.5
Continuous; Categorical vs. Quantitative Histogram: 4 types of axis input 107

Output of y Code Practice


axis
Express If you want to aes(x=Var1) ggplot(Pedestrian,
aes(x=Speed))+geom_histogram()
occurrences see Or, aes(x=Var1, y=..count..) Or,
frequency/ [must keep the y axis as scale_y_continuous(label = ggplot(Pedestrian, aes(x=Speed,
count scales::comma)] y=..count..))+geom_histogram()

If you want to aes(x=Var1, y=..count../ sum(..count..)) ggplot(Pedestrian, aes(x=Speed,


y=..count../sum(..count..)))+
see [may keep the y axis as scale_y_continuous(label = geom_histogram()
proportions scales::comma)]
or relative
frequency

If you want to aes(x=Var1, y=..count../ sum(..count..)) ggplot(Pedestrian, aes(x=Speed,


y=..count../sum(..count..)))+geom_hi
see [must keep the y axis as scale_y_continuous(label = stogram(bins=5)+
percentage scales::percent)] scale_y_continuous(label =
scales::percent)

Express If you want to aes(x=Var1, y=..density..) ggplot(Pedestrian, aes(x=Speed,


y=..density..))+geom_histogram()
probability see density [may keep the y axis as scale_y_continuous(label =
scales::comma)]
Continuous; Categorical vs. Quantitative Histogram: Add Mean Line 108
Continuous; Categorical vs. Quantitative Histogram: Add Mean Line 109

General Structure:
1D histogram:
geom_vline(aes(xintercept=mean(Speed)), color="blue", linetype="dashed", size=1)
Practice: ggplot(Pedestrian ,aes(x=Speed,))+
geom_histogram(color="cyan4",fill="cornflowerblue", bins = 5)+
theme_bw()+geom_vline(aes(xintercept=mean(Speed)), color="blue", linetype="dashed", size=1)

In case of more than 1D histogram, you need to create a data frame summarize the mean value
according to the categories as follows:
library(plyr)
V_Mean <- ddply(data,“Gender", summarise, grp.mean=mean(Speed))
Then add structure:
geom_vline(data=V_Mean, aes(xintercept=grp.mean, color=Gender), linetype="dashed",size=2)
Practice:
install.packages("plyr")
library(plyr)
V_mean<-ddply(Pedestrian, "Gender", summarise, grp.mean=mean(Speed))
V_mean
ggplot(Pedestrian , aes(x=Speed, y=..count../sum(..count..)))+
geom_histogram(color="cyan4",fill="cornflowerblue", bins = 5)+ theme_bw()+ facet_wrap(~Gender)+
geom_vline(data=V_mean, aes(xintercept=grp.mean, color=Gender), linetype="dashed",size=2)
Continuous; Categorical vs. Quantitative Kernel Density Plot 110

General Structure:
ggplot(data, aes(x=Var1, fill= Var2, color= Var3))+
geom_density(fill="cornflowerblue",
color="blue", alpha=1, bw=.1)+
scale_x_continuous(breaks = seq(0, 3, .5),
limits=c(0, 3), label = scales::comma)+
scale_y_continuous(breaks = seq(0, .2, .05),
limits=c(0, .2), label = scales::comma)+
labs(title="Car Characteristics",
subtitle="Proportion of Gear",
x="Number of Gear",
y="Density",
caption="Source: Car sale data of
Toyota")+
theme_gray()+
theme(axis.text.x = element_text(face="bold", angle =
45, color= "Red", hjust = .51))+
theme(axis.text.y = element_text(face="bold.italic",
angle = 45, color= "Red", hjust = .51))+
theme(text = element_text(color = "navy",
family="Comic Sans MS", size=24)+
facet_wrap(~Var2, ncol=1)
Continuous; Categorical vs. Quantitative Kernel Density Plot 111
…relative to other speed values,
0.5 to 1.5 has the greatest
General Structure:
probability density. A high density
library(ggplot2)
here means that if we randomly
library(scales)
select an observation, it’s most
ggplot(Pedestrian, aes(x=Speed, y=..density..))+
likely to fall within this
geom_histogram(color="white", fill=
bandwidth.
"cornflowerblue", bins=5)+
geom_density()
Histogram with density on y axis or
density curve both represent same output

Height around 1 unit??


No. The density can go above 1 in a density plot because
density does not represent probabilities directly (but higher
density is proportional to higher probability). Instead, it shows
the relative likelihood of different values, scaled such that the
area under the entire density curve sums up to 1, not the
height of the curve itself. When using density plots, especially
with smaller sample sizes or narrower intervals, it's possible
for the height (density) to exceed 1 in order to maintain the
correct total area. The height of the density at any given point
does not represent a probability but rather a relative density,
and a density value above 1 is still valid as long as the total
area remains normalized to 1
Continuous; Categorical vs. Quantitative Kernel Density Plot 112
Continuous; Categorical vs. Quantitative Kernel Density Plot - Add Mean Line 113

library(ggplot2)
ggplot(Pedestrian,aes(x=Speed))+geom_density(fill="cornflowerblue",
color="blue", alpha=.1)+ theme_bw()+
geom_vline(aes(xintercept=mean(Speed)), color="blue",
linetype="dashed", size=1)

library(plyr)
V_mean<-ddply(Pedestrian, "Gender", summarise, grp.mean=mean(Speed))
V_mean
ggplot(Pedestrian, aes(x=Speed, fill = Gender, alpha=.1))+geom_density()+
geom_vline(data=V_mean, aes(xintercept=grp.mean, color=Gender),
linetype="dashed",size=2)
Continuous; Categorical vs. Quantitative Kernel Density Plot - Add Mean Line 114

How??

Practice:
library(plyr)
V_mean<-ddply(Pedestrian, "Age",
summarise, grp.mean=mean(Speed))
V_mean

ggplot(Pedestrian, aes(x=Speed,
fill=Age))+geom_density(alpha=.1)+
geom_vline(data=V_mean,
aes(xintercept=grp.mean, color=Age),
linetype="dashed",size=2)+
facet_wrap(~Age)+
theme_bw()
Continuous; Categorical vs. Quantitative Histogram and Density Plot Together 115
Density

Count
General Structure:
library(plyr)
V_mean <- ddply(Pedestrian, "Age", summarise, grp.mean =
mean(Speed))
V_mean
ggplot(Pedestrian, aes(x = Speed, fill = Age)) +
geom_histogram(aes(y = ..density..), alpha = .2, color = "blue") +
geom_density(alpha = .1) +
geom_vline(data = V_mean, aes(xintercept = grp.mean, color = Age)) + Visually not pleasant, misleading
theme_bw() + facet_wrap(~Age) to some extent
Continuous; Categorical vs. Quantitative Histogram and Density Plot Together 116

Practice:
library(plyr)
data<-Pedestrian
V_Mean <- ddply(data, "Gender", summarise, mean=mean(Speed))
V_Mean

ggplot(data, aes(x=Speed, color=Gender, fill=Gender))+


geom_histogram(aes(y=..density..), alpha=.3, bins=30, size=1.3,
position = "identity")+
geom_vline(data=V_Mean, aes(xintercept=mean, color=Gender),
linetype="dashed", size=1.2)+
geom_density(alpha=0.2, size=2, fill=NA)+
scale_x_continuous(breaks = seq(0, 3, .5), limits=c(0, 3), label =
scales::comma)+
scale_y_continuous(breaks = seq(0, 2, .5), limits=c(0, 2), label =
scales::comma)+
labs(title="Histogram plot", x="Speed(m/s)", y = "Density",
fill= "Var2")+
theme_minimal()+
theme(axis.text.x = element_text(face="bold", angle = 0, color= "Red",
hjust = .51))+
theme(axis.text.y = element_text(face="bold.italic", angle = 0, color=
"Red", hjust = .51))+
theme(text = element_text(color = "navy", face="bold",family="Comic
Sans MS", size=24))
Continuous; Categorical vs. Quantitative Interactive Graph 117

✓ You can prepare an interactive graph by using


“plotly” package.
First, you need to bring the package. ⮕Then,
prepare your desire graph and assign that by
giving a name (for example- plot). ⮕ Finally use
ggplotly() function. For example- ggplotly(plot)

library(plyr)
data<-Pedestrian
V_Mean <- ddply(data, "Gender", summarise,
mean=mean(Speed))
V_Mean

Install.packages(“plotly”)
library(plotly)
plot<-ggplot(data, aes(x=Speed, fill=Gender))+
geom_histogram(aes(y=..density..),
alpha=.7,color="black", bins=40, size=.5)+
geom_density(alpha=.4)+
geom_vline(data=V_Mean,
aes(xintercept=mean, color=Gender))+
facet_wrap(~Gender)
Explore – Not for exam 118

1. Bar Plot for Continuous Variable:

First, you need to create a data frame


summarize the mean value according to
the categories as follows:

library(plyr)
V_Mean <-ddply(data,“Gender",
summarise, grp.mean=mean(Speed))

Then develop the chart according to 2D


Bar Plot.
Explore – Not for exam 119

2. Cleveland Dot Charts:

First, you need to create a data frame


summarize the mean value according
to the categories as follows:
library(plyr)
V_Mean <- ddply(data,“Gender",
summarise, grp.mean=mean(Speed))
Then develop the scattered plot
keeping continuous variable at x-axis.

General Standard Structure


library(plyr)
Speed_mean<- ddply(mydata, .(Intersection), summarize, Mean=mean(Speed))
Speed_mean
ggplot(Speed_mean, aes(x=Mean, y=reorder(Intersection, Mean))) + geom_point(color="blue", size = 4) +
geom_segment(aes(x = 1, xend = Mean, y = reorder(Intersection, Mean), yend = reorder(Intersection, Mean)),
color = "cornflowerblue", size=1) +
labs (x = "Crossing speed (m/s)", y = "", title = "Crossing Speed by Intersection")+
theme_bw()+
theme(axis.text.x = element_text(face="bold", angle = 0))+
theme(axis.text.y = element_text(face="bold", angle = 0))+
theme(text = element_text(size=14))
4. Sanky diagram:
Explore – Not for exam 120

3. Mosaic Plot:

5. Chord diagram:
Explore – Not for exam 121

6. Mean/SEM plots: 7. Fancy jittered plot: 8. Dot Chart

10. Dumble Plot:


9. Bubble plot:
Explore – Not for exam 122

11. Choropleth maps:

Check for better understanding:


https://rkabacoff.github.io/datavis/
https://bookdown.org/content/b298e479-b1ab-49fa-b83d-a57c2b034d49/distributions.html
https://r-charts.com/flow/sankey-diagram-ggplot2/
Many other websites….

You might also like