0% found this document useful (0 votes)
132 views20 pages

Ggplot2 Library Boxplot

The document provides information on creating boxplots in R using ggplot2. It demonstrates how to: 1) Create a basic boxplot with ggplot2 using geom_boxplot(), specifying quantitative and qualitative variables for the y and x axes. 2) Add individual data points to a boxplot using geom_jitter() to avoid overlaps and better visualize distributions. 3) Customize boxplot appearance by adjusting parameters like color, fill, outliers, and notches within geom_boxplot().

Uploaded by

Luis Emilio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
132 views20 pages

Ggplot2 Library Boxplot

The document provides information on creating boxplots in R using ggplot2. It demonstrates how to: 1) Create a basic boxplot with ggplot2 using geom_boxplot(), specifying quantitative and qualitative variables for the y and x axes. 2) Add individual data points to a boxplot using geom_jitter() to avoid overlaps and better visualize distributions. 3) Customize boxplot appearance by adjusting parameters like color, fill, outliers, and notches within geom_boxplot().

Uploaded by

Luis Emilio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

The ggplot2 library allows to make a boxplot using geom_boxplot().

You have to specify a quantitative variable for


the Y axis, and a qualitative variable for the X axis ( a group).

# Load ggplot2
library(ggplot2)

# The mtcars dataset is natively available


# head(mtcars)

# A really basic boxplot.


ggplot(mtcars, aes(x=as.factor(cyl), y=mpg)) +
geom_boxplot(fill="slateblue", alpha=0.2) +
xlab("cyl")

Boxplot with individual data points


If you’re not convinced about that danger of using basic boxplot, please read this post that explains it in depth.
Fortunately, ggplot2 makes it a breeze to add invdividual observation on top of boxes thanks to
the geom_jitter() function. This function shifts all dots by a random value ranging from 0 to size, avoiding overlaps.
Now, do you see the bimodal distribution hidden behind group B?

# Libraries
library(tidyverse)
library(hrbrthemes)
library(viridis)

# create a dataset
data <- data.frame(
name=c( rep("A",500), rep("B",500), rep("B",500), rep("C",20), rep('D', 100) ),
value=c( rnorm(500, 10, 5), rnorm(500, 13, 1), rnorm(500, 18, 1), rnorm(20, 25, 4), rnorm(100, 12, 1) )
)

# Plot
data %>%
ggplot( aes(x=name, y=value, fill=name)) +
geom_boxplot() +
scale_fill_viridis(discrete = TRUE, alpha=0.6) +
geom_jitter(color="black", size=0.4, alpha=0.9) +
theme_ipsum() +
theme(
legend.position="none",
plot.title = element_text(size=11)
)+
ggtitle("A boxplot with jitter") +
xlab("")

In case you’re not convinced, here is how the basic boxplot and the basic violin plot look like:
# Boxplot basic
data %>%
ggplot( aes(x=name, y=value, fill=name)) +
geom_boxplot() +
scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
theme_ipsum() +
theme(
legend.position="none",
plot.title = element_text(size=11)
)+
ggtitle("Basic boxplot") +
xlab("")

# Violin basic
data %>%
ggplot( aes(x=name, y=value, fill=name)) +
geom_violin() +
scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
theme_ipsum() +
theme(
legend.position="none",
plot.title = element_text(size=11)
)+
ggtitle("Violin chart") +
xlab("")
Ggplot2 boxplot parameters
This chart extends the previous most basic boxplot described in graph #262.
It describes the option you can apply to the geom_boxplot() function to custom the general chart appearance.
Note on notches: useful to compare groups: if no overlap between 2 groups, medians are significantly different.

# Load ggplot2
library(ggplot2)

# The mpg dataset is natively available


#head(mpg)
# geom_boxplot proposes several arguments to custom appearance
ggplot(mpg, aes(x=class, y=hwy)) +
geom_boxplot(

# custom boxes
color="blue",
fill="blue",
alpha=0.2,

# Notch?
notch=TRUE,
notchwidth = 0.8,

# custom outliers
outlier.colour="red",
outlier.fill="red",
outlier.size=3

Control ggplot2 boxplot colors

General color customization

These for examples illustrate the most common color scales used in boxplot.
Note the use of RcolorBrewer and viridis to automatically generate nice color palette.
# library
library(ggplot2)

# The mtcars dataset is natively available in R


#head(mpg)

# Top Left: Set a unique color with fill, colour, and alpha
ggplot(mpg, aes(x=class, y=hwy)) +
geom_boxplot(color="red", fill="orange", alpha=0.2)

# Top Right: Set a different color for each group


ggplot(mpg, aes(x=class, y=hwy, fill=class)) +
geom_boxplot(alpha=0.3) +
theme(legend.position="none")

# Bottom Left
ggplot(mpg, aes(x=class, y=hwy, fill=class)) +
geom_boxplot(alpha=0.3) +
theme(legend.position="none") +
scale_fill_brewer(palette="BuPu")

# Bottom Right
ggplot(mpg, aes(x=class, y=hwy, fill=class)) +
geom_boxplot(alpha=0.3) +
theme(legend.position="none") +
scale_fill_brewer(palette="Dark2")
Highlighting a group

Highlighting the main message conveid by your chart is an important step in dataviz. If your story focuses on a
specific group, you should highlight it in your boxplot.
To do so, first create a new column with mutate where you store the binary information: highlight ot not. Then just
provide this column to the fill argument of ggplot2 and eventually custom the appearance of the highlighted group
with scale_fill_manual and scale_alpha_manual.

# Libraries
library(ggplot2)
library(dplyr)
library(hrbrthemes)

# Work with the natively available mpg dataset


mpg %>%

# Add a column called 'type': do we want to highlight the group or not?


mutate( type=ifelse(class=="subcompact","Highlighted","Normal")) %>%

# Build the boxplot. In the 'fill' argument, give this column


ggplot( aes(x=class, y=hwy, fill=type, alpha=type)) +
geom_boxplot() +
scale_fill_manual(values=c("#69b3a2", "grey")) +
scale_alpha_manual(values=c(1,0.1)) +
theme_ipsum() +
theme(legend.position = "none") +
xlab("")

Add color to specific groups of a boxplot


A boxplot summarizes the distribution of a numeric variable for one or several groups.
It can be usefull to add colors to specific groups to highlight them. For exemple, positive and negative controls are
likely to be in different colors.
The easiest way is to give a vector (myColor here) of colors when you call the boxplot() function.
Use ifelse statements to add the color you want to a specific name.

#Create data
names <- c(rep("Maestro", 20) , rep("Presto", 20) ,
rep("Nerak", 20), rep("Eskimo", 20), rep("Nairobi", 20), rep("Artiko", 20))
value <- c( sample(3:10, 20 , replace=T) , sample(2:5, 20 , replace=T) ,
sample(6:10, 20 , replace=T), sample(6:10, 20 , replace=T) ,
sample(1:7, 20 , replace=T), sample(3:10, 20 , replace=T) )
data <- data.frame(names,value)

# Prepare a vector of colors with specific color for Nairobi and Eskimo
myColors <- ifelse(levels(data$names)=="Nairobi" , rgb(0.1,0.1,0.7,0.5) ,
ifelse(levels(data$names)=="Eskimo", rgb(0.8,0.1,0.3,0.6),
"grey90" ) )

# Build the plot


boxplot(data$value ~ data$names ,
col=myColors ,
ylab="disease" , xlab="- variety -")

# Add a legend
legend("bottomleft", legend = c("Positiv control","Negativ control") ,
col = c(rgb(0.1,0.1,0.7,0.5) , rgb(0.8,0.1,0.3,0.6)) , bty = "n", pch=20 , pt.cex = 3, cex = 1, horiz = FALSE, inset =
c(0.03, 0.1))

Basic R: X axis labels on several lines


It can be handy to display X axis labels on several lines. For instance, to add the number of values present in each
box of a boxplot.
How it works:
 Change the names of your categories using the names() function.
 Use \n to start new line
 Increase the distance between the labels and the X axis with the mgp argument of the par() function. It
avoids overlap with the axis.
Note: mgp is a numeric vector of length 3, which sets the axis label locations relative to the edge of the inner plot
window. Default value : c(3,1,0). First value : location the labels (xlab and ylab in plot). Second value : location of
the tick-mark labels (what we want to lower). Third Value : position of the tick marks

# Create 2 vectors
a <- sample(2:24, 20 , replace=T)
b <- sample(4:14, 8 , replace=T)

# Make a list of these 2 vectors


C <- list(a,b)

# Change the names of the elements of the list :


names(C) <- c(paste("Category 1\n n=" , length(a) , sep=""), paste("Category 2\n n=" , length(b) , sep=""))

# Change the mgp argument: avoid text overlaps axis


par(mgp=c(3,2,0))

# Final Boxplot
boxplot(C , col="#69b3a2" , ylab="value" )

Boxplot with jitter in base R


Boxplot can be dangerous: the exact distribution of each group is hidden behind boxes as explained in data-to-viz.
If the amount of observation is not too high, you can add individual observations on top of boxes, using jittering to
avoid dot overlap.
In base R, it is done manually creating a function that adds dot one by one, computing a random X position for all of
them.

# Create data
names <- c(rep("A", 80) , rep("B", 50) , rep("C", 70))
value <- c( rnorm(80 , mean=10 , sd=9) , rnorm(50 , mean=2 , sd=15) , rnorm(70 , mean=30 , sd=10) )
data <- data.frame(names,value)

# Basic boxplot
boxplot(data$value ~ data$names , col=terrain.colors(4) )

# Add data points


mylevels <- levels(data$names)
levelProportions <- summary(data$names)/nrow(data)
for(i in 1:length(mylevels)){

thislevel <- mylevels[i]


thisvalues <- data[data$names==thislevel, "value"]
# take the x-axis indices and add a jitter, proportional to the N in each level
myjitter <- jitter(rep(i, length(thisvalues)), amount=levelProportions[i]/2)
points(myjitter, thisvalues, pch=20, col=rgb(0,0,0,.9))

}
Reordering category by median

The most common need is to reorder categories by increasing median. It allows to quickly spot what group has the
highest value and how categories are ranked.
It is doable using the reorder() function in combination with the with() function as suggested below:

# Create data : 7 varieties / 20 samples per variety / a numeric value for each sample
variety <- rep( c("soldur", "silur", "lloyd", "pescadou", "X4582", "Dudur", "Classic"), each=20)
note <- c( sample(2:5, 20 , replace=T) , sample(6:10, 20 , replace=T),
sample(1:7, 30 , replace=T), sample(3:10, 70 , replace=T) )
data <- data.frame(variety, note)

# Create a vector named "new_order" containing the desired order


new_order <- with(data, reorder(variety , note, median , na.rm=T))
# Draw the boxplot using this new order
boxplot(data$note ~ new_order , ylab="sickness" , col="#69b3a2", boxwex=0.4 , main="")
Give a specific order

Boxplot categories are provided in a column of the input data frame. This column needs to be a factor, and has
several levels. Categories are displayed on the chart following the order of this factor, often in alphabetical order.
Sometimes, we need to show groups in a specific order (A,D,C,B here). This can be done by reordering the levels,
using the factor() function.

#Creating data
names <- c(rep("A", 20) , rep("B", 20) , rep("C", 20), rep("D", 20))
value <- c( sample(2:5, 20 , replace=T) , sample(6:10, 20 , replace=T),
sample(1:7, 20 , replace=T), sample(3:10, 20 , replace=T) )
data <- data.frame(names,value)

# Classic boxplot (A-B-C-D order)


# boxplot(data$value ~ data$names)

# I reorder the groups order : I change the order of the factor data$names
data$names <- factor(data$names , levels=c("A", "D", "C", "B"))

#The plot is now ordered !


boxplot(data$value ~ data$names , col=rgb(0.3,0.5,0.4,0.6) , ylab="value" ,
xlab="names in desired order")
Grouped and ordered boxplot

In a grouped boxplot, categories are organized in groups and subgroups. For instance, let’s take several varieties
(group) that are grown in high or low temperature (subgroup).
Here both subgroups are represented one beside each other, and groups are ranked by increasing median:

# Create dummy data


variety <- rep( c("soldur", "silur", "lloyd", "pescadou", "X4582", "Dudur", "Classic"), each=40)
treatment <- rep(c(rep("high" , 20) , rep("low" , 20)) , 7)
note <- c( rep(c(sample(0:4, 20 , replace=T) , sample(1:6, 20 , replace=T)),2),
rep(c(sample(5:7, 20 , replace=T), sample(5:9, 20 , replace=T)),2),
c(sample(0:4, 20 , replace=T) , sample(2:5, 20 , replace=T),
rep(c(sample(6:8, 20 , replace=T) , sample(7:10, 20 , replace=T)),2) ))
data=data.frame(variety, treatment , note)

# Reorder varieties (group) (mixing low and high treatments for the calculations)
new_order <- with(data, reorder(variety , note, mean , na.rm=T))

# Then I make the boxplot, asking to use the 2 factors : variety (in the good order) AND treatment :
par(mar=c(3,4,3,1))
myplot <- boxplot(note ~ treatment*new_order , data=data ,
boxwex=0.4 , ylab="sickness",
main="sickness of several wheat lines" ,
col=c("slateblue1" , "tomato") ,
xaxt="n")

# To add the label of x axis


my_names <- sapply(strsplit(myplot$names , '\\.') , function(x) x[[2]] )
my_names <- my_names[seq(1 , length(my_names) , 2)]
axis(1,
at = seq(1.5 , 14 , 2),
labels = my_names ,
tick=FALSE , cex=0.3)

# Add the grey vertical lines


for(i in seq(0.5 , 20 , 2)){
abline(v=i,lty=1, col="grey")
}

# Add a legend
legend("bottomright", legend = c("High treatment", "Low treatment"),
col=c("slateblue1" , "tomato"),
pch = 15, bty = "n", pt.cex = 3, cex = 1.2, horiz = F, inset = c(0.1, 0.1))

Boxplot with variable width


When the sample size behind each category is highly variable, it can be great to represent it through the box
widths.
First calculate the proportion of each level using the table() function. Using these proportions will make the box
twice bigger if a level is twice more represented. Then give these proportions to the width argument when you call
the boxplot() function.

# Dummy data
names <- c(rep("A", 20) , rep("B", 8) , rep("C", 30), rep("D", 80))
value <- c( sample(2:5, 20 , replace=T) , sample(4:10, 8 , replace=T),
sample(1:7, 30 , replace=T), sample(3:8, 80 , replace=T) )
data <- data.frame(names,value)

# Calculate proportion of each level


proportion <- table(data$names)/nrow(data)

#Draw the boxplot, with the width proportionnal to the occurence !


boxplot(data$value ~ data$names , width=proportion , col=c("orange" , "seagreen"))

Add text over boxplot in base R


The first challenge here is to recover the position of the top part of each box. This is done by saving
the boxplot() result in an object (called boundaries here). Now, typing boundaries$stats gives a dataframe with all
information concerning boxes.
Then, it is possible to use the text function to add labels on top of each box. This function takes 3 inputs:
 x axis positions of the labels. In our case, it will be 1,2,3,4 for 4 boxes.
 y axis positions, available in the boundaries$stats object.
 text of the labels : the number of value per group or whatever else.

# Dummy data
names <- c(rep("A", 20) , rep("B", 8) , rep("C", 30), rep("D", 80))
value <- c( sample(2:5, 20 , replace=T) , sample(4:10, 8 , replace=T),
sample(1:7, 30 , replace=T), sample(3:8, 80 , replace=T) )
data <- data.frame(names,value)

# Draw the boxplot. Note result is also stored in a object called boundaries
boundaries <- boxplot(data$value ~ data$names , col="#69b3a2" , ylim=c(1,11))
# Now you can type boundaries$stats to get the boundaries of the boxes

# Add sample size on top


nbGroup <- nlevels(data$names)
text(
x=c(1:nbGroup),
y=boundaries$stats[nrow(boundaries$stats),] + 0.5,
paste("n = ",table(data$names),sep="")
)

Tukey Test and boxplot in R

Tukey test is a single-step multiple comparison procedure and statistical test. It is a post-hoc analysis, what means
that it is used in conjunction with an ANOVA.
It allows to find means of a factor that are significantly different from each other, comparing all possible pairs of
means with a t-test like method. (Read more for the exact procedure)
In R, the multcompView allows to run the Tukey test thanks to the TukeyHSD() function. It also offers a chart that
shows the mean difference for each pair of group.
# library
library(multcompView)

# Create data
set.seed(1)
treatment <- rep(c("A", "B", "C", "D", "E"), each=20)
value=c( sample(2:5, 20 , replace=T) , sample(6:10, 20 , replace=T), sample(1:7, 20 , replace=T), sample(3:10,
20 , replace=T) , sample(10:20, 20 , replace=T) )
data=data.frame(treatment,value)

# What is the effect of the treatment on the value ?


model=lm( data$value ~ data$treatment )
ANOVA=aov(model)

# Tukey test to study each pair of treatment :


TUKEY <- TukeyHSD(x=ANOVA, 'data$treatment', conf.level=0.95)

# Tuckey test representation :


plot(TUKEY , las=1 , col="brown")
Tukey test result on top of boxplot

The previous chart showed no significant difference between groups A and C, and between D and B.
It is possible to represent this information in a boxplot. Group A and C are represented using a similar way: same
color, and same ‘b’ letter on top. And so on for B-D and for E.
# I need to group the treatments that are not different each other together.
generate_label_df <- function(TUKEY, variable){

# Extract labels and factor levels from Tukey post-hoc


Tukey.levels <- TUKEY[[variable]][,4]
Tukey.labels <- data.frame(multcompLetters(Tukey.levels)['Letters'])

#I need to put the labels in the same order as in the boxplot :


Tukey.labels$treatment=rownames(Tukey.labels)
Tukey.labels=Tukey.labels[order(Tukey.labels$treatment) , ]
return(Tukey.labels)
}

# Apply the function on my dataset


LABELS <- generate_label_df(TUKEY , "data$treatment")

# A panel of colors to draw each group with the same color :


my_colors <- c(
rgb(143,199,74,maxColorValue = 255),
rgb(242,104,34,maxColorValue = 255),
rgb(111,145,202,maxColorValue = 255)
)

# Draw the basic boxplot


a <- boxplot(data$value ~ data$treatment , ylim=c(min(data$value) , 1.1*max(data$value)) ,
col=my_colors[as.numeric(LABELS[,1])] , ylab="value" , main="")

# I want to write the letter over each box. Over is how high I want to write it.
over <- 0.1*max( a$stats[nrow(a$stats),] )

#Add the labels


text( c(1:nlevels(data$treatment)) , a$stats[nrow(a$stats),]+over , LABELS[,1] ,
col=my_colors[as.numeric(LABELS[,1])] )

Note: Tukey test is also called: Tukey’s range test / Tukey method / Tukey’s honest significance test / Tukey’s HSD
(honest significant difference) test / Tukey-Kramer method
Control box type with the bty option

The bty option of the par() function allows to custom the box around the plot.


Several letters are possible. Shape of the letter represents the boundaries:
 o: complete box (default parameter),
 n: no box
 7: top + right
 L: bottom + left
 C: top + left + bottom
 U: left + bottom + right

# Cut the screen in 4 parts


par(mfrow=c(2,2))

#Create data
a=seq(1,29)+4*runif(29,0.4)
b=seq(1,29)^2+runif(29,0.98)

# First graph
par(bty="l")
boxplot(a , col="#69b3a2" , xlab="bottom & left box")
# Second
par(bty="o")
boxplot(b , col="#69b3a2" , xlab="complete box", horizontal=TRUE)
# Third
par(bty="c")
boxplot(a , col="#69b3a2" , xlab="up & bottom & left box", width=0.5)
# Fourth
par(bty="n")
boxplot(a , col="#69b3a2" , xlab="no box")
Split base R plot window with layout()

The layout() function of R allows to split the plot window in areas with custom sizes. Here are a few examples
illustrating how to use it with reproducible code and explanation.

2 rows

Layout divides the device up into as many rows and columns as there are in matrix mat.
Here I create the matrix with matrix(c(1,2), ncol=1) -> 1 column, 2 rows. This is what I get in the chart!
Note: this could be done using par(mfrow=c(1,2)) as well. But this option does not allow the customization we’ll see
further in this post.

# Dummy data
a <- seq(129,1)+4*runif(129,0.4)
b <- seq(1,129)^2+runif(129,0.98)

# Create the layout


nf <- layout( matrix(c(1,2), ncol=1) )

# Fill with plots


hist(a , breaks=30 , border=F , col=rgb(0.1,0.8,0.3,0.5) , xlab="distribution of a" , main="")
boxplot(a , xlab="a" , col=rgb(0.8,0.8,0.3,0.5) , las=2)
2 columns

Here I create the matrix with matrix(c(1,2), ncol=2) -> 2 columns, 1 row. This is what I get in the chart!
Note: if you swap to c(2,1), second chart will be on top, first at the bottom
# Dummy data
a <- seq(129,1)+4*runif(129,0.4)
b <- seq(1,129)^2+runif(129,0.98)

# Create the layout


nf <- layout( matrix(c(1,2), ncol=2) )

# Fill with plots


hist(a , breaks=30 , border=F , col=rgb(0.1,0.8,0.3,0.5) , xlab="distribution of a" , main="")
boxplot(a , xlab="a" , col=rgb(0.8,0.8,0.3,0.5) , las=2)
Subdivide second row

matrix(c(1,1,2,3), nrow=2) creates a matrix of 2 rows and 2 columns. First 2 panels will be for the first chart, the
third for chart2 and the last for chart 3.
# Dummy data
a <- seq(129,1)+4*runif(129,0.4)
b <- seq(1,129)^2+runif(129,0.98)

# Create the layout


nf <- layout( matrix(c(1,1,2,3), nrow=2, byrow=TRUE) )

# Fill with plots


hist(a , breaks=30 , border=F , col=rgb(0.1,0.8,0.3,0.5) , xlab="distribution of a" , main="")
boxplot(a , xlab="a" , col=rgb(0.8,0.8,0.3,0.5) , las=2)
boxplot(b , xlab="b" , col=rgb(0.4,0.2,0.3,0.5) , las=2)
Custom proportions

You can custom columns and row proportions with widths and heights.


Here, widths=c(3,1) means first column takes three quarters of the plot window width, second takes one quarter.
# Dummy data
a <- seq(129,1)+4*runif(129,0.4)
b <- seq(1,129)^2+runif(129,0.98)

# Set the layout


nf <- layout(
matrix(c(1,1,2,3), ncol=2, byrow=TRUE),
widths=c(3,1),
heights=c(2,2)
)

#Add the plots


hist(a , breaks=30 , border=F , col=rgb(0.1,0.8,0.3,0.5) , xlab="distribution of a" , main="")
boxplot(a , xlab="a" , col=rgb(0.8,0.8,0.3,0.5) , las=2)
boxplot(b , xlab="b" , col=rgb(0.4,0.2,0.3,0.5) , las=2)

You might also like