Rfactors
Rfactors
Gaston Sanchez
gastonsanchez.com
About this ebook
Abstract
This ebook aims to show you basic tools for handling categorical data with
R factors.
Citation
Sanchez, G. (2018) A booklet of R factors
URL http://www.gastonsanchez.com/rfactors.pdf
Source
Github Repository:
https://github.com/gastonstat/rfactors
License
Creative Commons Attribution-ShareAlike 4.0 Unported:
http://creativecommons.org/licenses/by-sa/4.0/
Revision
November 17, 2018
Version 1.0
i
Contents
1 Categorical Data in R . . . . . . . . . . . . . . . . . . . . . 1
1.1 Creating Factors . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 How R treats factors? . . . . . . . . . . . . . . . . . 3
1.2 Factors and Data Tables . . . . . . . . . . . . . . . . . . . . 5
1.2.1 What is the advantage of R factors? . . . . . . . . . 7
1.2.2 When to factor? . . . . . . . . . . . . . . . . . . . . 8
ii
Chapter 1
Categorical Data in R
I’m one of those with the humble opinion that great software for data science
and analytics should have a data structure dedicated to handle categorical
data. Lucky for us, R is one of the greatest. In case you’re not aware, one of
the nicest features about R is that it provides a data structure exclusively
designed to handle categorical data: factors.
The term “factor” as used in R for handling categorical variables, comes
from the terminology used in Analysis of Variance, commonly referred to
as ANOVA. In this statistical method, a categorical variable is commonly
referred to as factor and its categories are known as levels. Perhaps this is
not the best terminology but it is the one R uses, which reflects its distinc-
tive statistical origins. Especially for those users without a brackground in
statistics, this is one of R’s idiosyncracies that seems disconcerning at the
beginning. But as long as you keep in mind that a factor is just the object
that allows you to handle a qualitative variable you’ll be fine. In case you
need it, here’s a short mantra to remember: “factors have levels”.
1
2
# numeric vector
num_vector <- c(1, 2, 3, 1, 2, 3, 2)
first_factor
## [1] 1 2 3 1 2 3 2
## Levels: 1 2 3
As you can tell from the previous code snippet, factor() converts the
numeric vector num vector into a factor (i.e. a categorical variable) with 3
categories —the so called levels.
You can also obtain a factor from a string vector:
# string vector
str_vector <- c('a', 'b', 'c', 'b', 'c', 'a', 'c', 'b')
str_vector
## [1] "a" "b" "c" "b" "c" "a" "c" "b"
# creating a factor from str_vector
second_factor <- factor(str_vector)
second_factor
## [1] a b c b c a c b
## Levels: a b c
Notice how str vector and second factor are displayed. Even though the
elements are the same in both the vector and the factor, they are printed in
different formats. The letters in the string vector are displayed with quotes,
while the letters in the factor are printed without quotes.
And of course, you can use a logical vector to generate a factor as well:
# logical vector
log_vector <- c(TRUE, FALSE, TRUE, TRUE, FALSE)
third_factor
## [1] TRUE FALSE TRUE TRUE FALSE
This means that we can manipulate factors just like we manipulate vectors.
In addition, many functions for vectors can be applied to factors. For
instance, we can use the function length() to get the number of elements
in a factor:
# factors have length
length(first_factor)
## [1] 7
## Levels: 1 2 3
# second to fourth elements
first_factor[2:4]
## [1] 2 3 1
## Levels: 1 2 3
# last element
first_factor[length(first_factor)]
## [1] 2
## Levels: 1 2 3
# logical subsetting
first_factor[rep(c(TRUE, FALSE), length.out = 7)]
## [1] 1 3 2 2
## Levels: 1 2 3
If you have a factor with named elements, you can also specify the names
of the elements within the brackets:
names(first_factor) <- letters[1:length(first_factor)]
first_factor
## a b c d e f g
## 1 2 3 1 2 3 2
## Levels: 1 2 3
first_factor[c('b', 'd', 'f')]
## b d f
## 2 1 3
## Levels: 1 2 3
However, you should know that factors are NOT really vectors. To see this
you can check the behavior of the functions is.factor() and is.vector()
on a factor:
# factors are not vectors
is.vector(first_factor)
## [1] FALSE
# factors are factors
is.factor(first_factor)
## [1] TRUE
Another feature that makes factors so special is that their values (the levels)
are mapped to a set of character values for displaying purposes. This might
seem like a minor feature but it has two important consequences. On the
one hand, this implies that factors provide a way to store character values
very efficiently. Why? Because each unique character value is stored only
once, and the data itself is stored as a vector of integers.
Notice how the numeric value 1 was mapped into the character value "1".
And the same happens for the other values 2 and 3 that are mapped into
the characters "2" and "3".
Usually we get data in some file. Rarely is the case when we have to input
data manually. The files can be in text format; they can also be plain text
format. Typical file extensions are csv, tsv, xml, json, or whatever other
format. The most frequent case is either read data as text, or read data in
some type of tabular format (spreadsheet-like table).
For better or worse, when reading tabular data in R, the default behavior of
Note that the size of the factor iris factor is less than the character
# factor
cols <- factor(colrs)
If we compare the size of the objects, you’ll notice that the vector occupies
much less memory than the factor
# comparing sizes
object.size(colrs)
## 232 bytes
object.size(cols)
## 608 bytes
Every time I teach about factors, there is inevitably one student who asks a
very pertinent question: Why do we want to use factors? Isn’t it redundant
to have a factor object when there are already character or integer vectors?
I have two answers to this question.
The first has to do with the storage of factors. Storing a factor as in-
tegers will usually be more efficient than storing a character vector. As
we’ve seen, this is an important issue especially when factors are of con-
siderable size. The second reason has to do with ordinal variables. Qual-
itative data can be classified into nominal and ordinal variables. Nominal
variables could be easily handled with character vectors. In fact, nom-
inal means name (values are just names or labels), and there’s no nat-
ural order among the categories. A different case is when we have or-
dinal variables, like sizes "small", "medium", "large" or college years
"freshman", "sophomore", "junior", "senior". In these cases we are
still using names of categories, but they can be arranged in increasing or
decreasing order. In other words, we can rank the categories since they
have a natural order: small is less than medium which is less than large.
Likewise, freshman comes first, then sophomore, followed by junior, and
finally senior.
So here’s an important question: How do we keep the order of categories
in an ordinal variable? We can use a character vector to store the values.
But a character vector does not allow us to store the ranking of categories.
The solution in R comes via factors. We can use factors to define ordinal
variables, like the following example:
sizes <- factor(c('sm', 'md', 'lg', 'sm', 'md'),
levels = c('sm', 'md', 'lg'),
ordered = TRUE)
sizes
## [1] sm md lg sm md
## Levels: sm < md < lg
We’ll take in more detail about ordinal factors in the next chapter. For
now, just keep in mind the sizes example. As you can tell, sizes has
ordered levels, clearly identifying the first category "sm", the second one
"md", and the third one "lg"
or names of individuals.
There is also a another aspect related to the question of whether to convert
strings as factors or not. In practice, most of the times we’ll be working
with some data set. The data we’ll be provided with is what I call the raw
data. Pretty much all real-life data analysis projects require the analyst to
process, clean, and transform the raw data into a clean data version, also
called tidy data. Since the processing-cleaning phase involves manipulating
text, formatting, grouping, changing scales, splitting strings, and a wide
variety of operations, it is better to import the raw data and leave strings
as characters. Once the clean data set has been created, then we can work
with this version and convert some of the strings into factors.
11
12
first_factor
## [1] 1 2 3 1 2 3 2
## Levels: 1 2 3
Although the created factor only has values between 1 and 3, the levels
range from 1 to 5. This can be useful if we plan to add elements whose
values are not in the input vector num vector. For instance, you can append
two more elements to one factor with values 4 and 5 like this:
If you attempt to insert an element having a value that is not in the prede-
fined set of levels, R will insert a missing value (<NA>) instead, and you’ll
get a warning message like the one below:
# attempting to add value 6 (not in levels)
one_factor[1] <- 6
## Warning in ‘[<-.factor‘(‘*tmp*‘, 1, value = 6): invalid factor level,
NA generated
one_factor
## [1] <NA> 2 3 1 2 3 2 4 5
## Levels: 1 2 3 4 5
num_word_vector
## [1] one two three one two three two
## Levels: one two three
Argument exclude. If you want to ignore some values of the input vector
x, you can use the exclude argument. You just need to provide those values
which will be removed from the set of levels.
# excluding level 3
factor(num_vector, exclude = 3)
## [1] 1 2 <NA> 1 2 <NA> 2
## Levels: 1 2
The side effect of exclude is that it returns a missing value (<NA>) for each
element that was excluded, which is not always what we want. Here’s one
way to remove the missing values when excluding 3:
# excluding level 3
num_fac12 <- factor(num_vector, exclude = 3)
We’ve seen how to work with the argument levels inside the function
factor(). But that’s not the only way in which you can manipulate the
levels of a factor. Closely related to factor() there are two other important
sibling functions: levels() and nlevels().
The function levels() lets you have access to the levels attribute of
a factor. This means that you can use levels() for both: getting the
categories, and setting the categories.
Getting levels. To get the different values for the categories in a factor
you just need to apply levels() on a factor:
# levels()
levels(first_factor)
## [1] "1" "2" "3"
levels(third_factor)
Setting levels. If what you want is to specify the levels attribute, you
must use the function levels() followed by the assignment operator <-.
Suppose that we want to change the levels of first factor and express
them in roman numerals. You can achieve this with:
# copy of first factor
first_factor_copy <- first_factor
first_factor_copy
## [1] 1 2 3 1 2 3 2
## Levels: 1 2 3
# setting new levels
levels(first_factor_copy) <- c("I", "II", "III")
first_factor_copy
## [1] I II III I II III II
## Levels: I II III
Don’t confuse length() with nlevels(). The former returns the number
of elements in a factor, while the latter returns the number of levels.
For example, say we want to combine categories I and III into a new level
I+III. Here’s how to do it:
# nlevels()
levels(first_factor) <- c("I+III", "II", "I+III")
# equivalent to
first_factor
## [1] I+III II I+III I+III II I+III II
## Levels: I+III II
Note that the length of the vector specifying the merged categories will be
the same as the number of levels.
Factors with missing values. Missing values are ubiquitous and they
can appear in any data set. This means that we can have categorical
variables with missing values. For instance, let’s say we have a vector
drinks with the type of drink ordered by a group of 7 individuals:
# vector of drinks
drink_type <- c('water', 'water', 'beer', 'wine', 'soda', 'water', NA)
drink_type
## [1] "water" "water" "beer" "wine" "soda" "water" NA
As you can tell from the vector drink type, there is a missing value for the
last element. Now let’s convert this vector into a factor:
# drinks factor
drinks <- factor(drink_type)
drinks
## [1] water water beer wine soda water <NA>
## Levels: beer soda water wine
drinks
Now that there is a level dedicated to missing values, you can assign a string
"NA" to the actual missing values:
drinks[is.na(drinks)] <- "NA"
drinks
## [1] water water beer wine soda water NA
## Levels: beer soda water wine NA
Notice that drinks has now a level for NA. Notice also that this label is not
anymore displayed as <NA>. In other words, this is not an R missing value
anymore. It is just another category or level:
is.na(drinks)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
If you want to specify an ordinal factor you must use the ordered argu-
ment of factor(). This is how you can generate an ordinal value from
num vector:
# ordinal factor from numeric vector
ordinal_num <- factor(num_vector, ordered = TRUE)
ordinal_num
## [1] 1 2 3 1 2 3 2
## Levels: 1 < 2 < 3
As you can tell from the snippet above, the levels of ordinal factor are
displayed with less-than symbols ’<’, which means that the levels have an
increasing order. We can also get an ordinal factor from our string vector:
In fact, when you set ordered = TRUE, R sorts the provided values in al-
phanumeric order. If you have the following alphanumeric vector ("a1",
"1a", "1b", "b1"), what do you think will be the generated ordered fac-
tor? Let’s check the answer:
# alphanumeric vector
alphanum <- c("a1", "1a", "1b", "b1")
Of course, you won’t always be using the default order provided by the
functions factor(..., ordered = TRUE) or ordered(). Sometimes you
want to determine categories according to a different order.
For example, let’s take the values of str vector and let’s assume that we
want them in descending order, that is, c < b < a. How can you do that?
Easy, you just need to specify the levels in the order you want them and
set ordered = TRUE (or use ordered()):
# setting levels with specified order
factor(str_vector, levels = c("c", "b", "a"), ordered = TRUE)
## [1] a b c b c a c b
## Levels: c < b < a
# equivalently
ordered(str_vector, levels = c("c", "b", "a"))
## [1] a b c b c a c b
## Levels: c < b < a
Notice that when you create an ordinal factor, the given levels will always
be considered in an increasing order. This means that the first value of
levels will be the smallest one, then the second one, and so on. The last
category, in turn, is taken as the one at the top of the scale.
Now that we have several nominal and ordinal factors, we can compare the
behavior of is.ordered() on two factors:
# is.ordered() on an ordinal factor
ordinal_str
## [1] a b c b c a c b
## Levels: a < b < c
is.ordered(ordinal_str)
## [1] TRUE
# is.ordered() on a nominal factor
second_factor
## [1] a b c b c a c b
## Levels: a b c
is.ordered(second_factor)
## [1] FALSE
We’ve mentioned that factors are stored as vectors of integers (for efficiency
reasons). But we also said that factors are more than vectors. Even though
a factor is displayed with string labels, the way it is stored internally is as
integers. Why is this important to know? Because there will be occasions
in which you’ll need to know exactly what numbers are associated to each
level values.
Imagine you have a factor with levels 11, 22, 33, 44.
# factor
xfactor <- factor(c(22, 11, 44, 33, 11, 22, 44))
xfactor
## [1] 22 11 44 33 11 22 44
## Levels: 11 22 33 44
To obtain the integer vector associated to xfactor you can use the function
unclass():
# unclassing a factor
unclass(xfactor)
## [1] 2 1 4 3 1 2 4
## attr(,"levels")
## [1] "11" "22" "33" "44"
As you can see, the levels ("11" "22" "33" "44") were mapped to the
vector of integers (1 2 3 4).
An alternative option is to simply apply as.numeric() or as.integer()
instead of using unclass():
# equivalent to unclass
as.integer(xfactor)
## [1] 2 1 4 3 1 2 4
# equivalent to unclass
as.numeric(xfactor)
## [1] 2 1 4 3 1 2 4
Although rarely used, there can be some cases in which what you need to
do is revert the integer values in order to get the original factor levels. This
is only possible when the levels of the factor are themselves numeric. To
accomplish this use the following command:
# recovering numeric levels
as.numeric(levels(xfactor))[xfactor]
## [1] 22 11 44 33 11 22 44
Sometimes it’s not enough with creating an ordinal variable or setting the
order of the categories. Occasionally, you would like to reorder a factor.
For this purpose you can use the function reorder()
vowels <- c('a', 'b', 'c', 'd', 'e')
set.seed(975)
vowels_fac <- factor(sample(vowels, size = 20, replace = TRUE))
#reorder(vowels_fac, count, median)
# reordering a factor
bymedian <- with(InsectSprays, reorder(spray, count, median))
## Levels: 1 2 3 4 5
# reorder levels of unordered factor
relevel(one_factor, ref = 5)
## [1] <NA> 2 3 1 2 3 2 4 5
## Levels: 5 1 2 3 4
”There are two tasks that are often performed on factors. One is to drop
unused levels; this can be achieved by a call to factor() since factor(y)
will drop any unused levels from y if y is a factor.”
”The second task is to coarsen the levels of a factor, that is, group two or
more of them together into a single new level.”
y <- sample(letters[1:5], 20, rep = TRUE)
v <- as.factor(y)
xx <- list(I = c("a", "e"), II = c("b", "c", "d"))
levels(v) <- xx
v
## [1] II I I I I I II II II II I II I I II II I I I I
## Levels: I II
A common data manipulation task is: how to get a categorical variable from
a quantitative variable? In other words, how to discretize or categorize a
quantitative variable?
For this kind of common task R provides the handy function cut(). The
idea is to cut values of a numeric input vector into intervals, which in turn
will be the levels of the generated factor. The usage of cut() is:
cut(x, breaks, labels = NULL, include.lowest = FALSE,
right = TRUE, dig.lab = 3, ordered_result = FALSE, ...)
with the following arguments:
• x a numeric vector which is to be converted to a factor by cutting.
• breaks numeric vector giving the number of intervals into which x is
to be cut.
• labels labels for the levels of the resulting category.
23
24
To convert income into a factor we use cut(). The first argument is the
input vector (income in this case). The argument breaks is used to indicate
the number of categories or levels of the output factor (e.g. 10)
# cutting a quantitative variable
income_level <- cut(x = income, breaks = 10)
levels(income_level)
## [1] "(99.7,140]" "(140,180]" "(180,220]" "(220,260]" "(260,300]"
## [6] "(300,340]" "(340,380]" "(380,420]" "(420,460]" "(460,500]"
As you can tell, income level has 10 levels; each level formed by an in-
terval. Moreover, the intervals are all of the same form: a range of values
with the lower bound surrounded by a parenthesis, and the upper bound
surrounded by a bracket.
You can inspect the produced factor income level and check the frequen-
cies with table()
table(income_level)
## income_level
## (99.7,140] (140,180] (180,220] (220,260] (260,300] (300,340]
## 85 106 113 84 87 97
## (340,380] (380,420] (420,460] (460,500]
## 115 103 98 112
By default, cut() has its argument right set to TRUE. This means that the
intervals are open on the left (and closed on the right):
To change the default way in which intervals are open and closed you can
set right = FALSE. This option produces intervals closed on the left and
open on the right:
# using other cutting break points
income_b <- cut(x = income, breaks = income_breaks, right = FALSE)
table(income_b)
## income_b
## [100,150) [150,200) [200,250) [250,300) [300,350) [350,400) [400,450)
## 111 139 122 103 124 135 131
## [450,500)
## 135
sum(table(income_b))
## [1] 1000
You can change the labels of the levels using the argument labels. For
example, let’s say we want to name the resulting levels with letters. The
first level [100,150) will be changed to "a", the second level [150,200) will
be changed to "b", and so on.
income_c <- cut(x = income, breaks = income_breaks,
labels = letters[1:(length(income_breaks)-1)])
table(income_c)
## income_c
## a b c d e f g h
## 111 139 122 103 124 135 131 135
The main inputs of gl() are n and k, that is, the number of levels and the
number of replications of each level. Especially for working with data under
the approach of Design of Experiments (DoE), gl() can be very useful.
Here’s another example setting the arguments labels and length:
# another factor with gl()
girl_boy = gl(2, 4, labels = c("girl", "boy"), length = 7)
girl_boy
## [1] girl girl girl girl boy boy boy
## Levels: girl boy
By default, the total number of elements is 8 (n=2 × k=4). Four girl’s and
four boy’s. But since we set the argument length = 7, we only got three
boy’s.