0% found this document useful (0 votes)

65 views31 pages

Rfactors

This document discusses R factors, which are R's data structure for handling categorical data. It covers how to create factors from numeric, character, and logical vectors using the factor() function. Factors are internally stored as integer vectors, with an additional attribute containing the unique category "levels". The document recommends reading in tabular data with stringsAsFactors = FALSE to avoid unintended conversion of character columns to factors. It also discusses subsetting, manipulating, and ordering factors.

Uploaded by

ki_soewarsono

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views31 pages

Rfactors

Uploaded by

ki_soewarsono

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

A booklet of R factors

Gaston Sanchez
gastonsanchez.com
About this ebook

Abstract
This ebook aims to show you basic tools for handling categorical data with
R factors.

About the reader

I am assuming two things about you: 1) You acknowledge the importance
of data exploration and visualization; 2) you have some knowledge of the
statistical software R.

Citation
Sanchez, G. (2018) A booklet of R factors
URL http://www.gastonsanchez.com/rfactors.pdf

Source
Github Repository:
https://github.com/gastonstat/rfactors

License
Creative Commons Attribution-ShareAlike 4.0 Unported:
http://creativecommons.org/licenses/by-sa/4.0/

Revision
November 17, 2018
Version 1.0

i
Contents

1 Categorical Data in R . . . . . . . . . . . . . . . . . . . . . 1
1.1 Creating Factors . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 How R treats factors? . . . . . . . . . . . . . . . . . 3
1.2 Factors and Data Tables . . . . . . . . . . . . . . . . . . . . 5
1.2.1 What is the advantage of R factors? . . . . . . . . . 7
1.2.2 When to factor? . . . . . . . . . . . . . . . . . . . . 8

2 A closer look at factor() . . . . . . . . . . . . . . . . . . . 11

2.1 Function factor() . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.1 Manipulating Categories or Levels . . . . . . . . . . 14
2.1.2 Ordinal factors . . . . . . . . . . . . . . . . . . . . . 17
2.1.3 Unclassing factors . . . . . . . . . . . . . . . . . . . 20
2.1.4 Reordering factors . . . . . . . . . . . . . . . . . . . 21
2.1.5 Dropping levels . . . . . . . . . . . . . . . . . . . . . 22

3 More about Factors . . . . . . . . . . . . . . . . . . . . . . 23

3.1 Categorizing a quantitative variable . . . . . . . . . . . . . 23
3.1.1 Factor into indicators . . . . . . . . . . . . . . . . . 26
3.1.2 Generating Factors Levels with gl() . . . . . . . . . 26

ii
Chapter 1

Categorical Data in R

I’m one of those with the humble opinion that great software for data science
and analytics should have a data structure dedicated to handle categorical
data. Lucky for us, R is one of the greatest. In case you’re not aware, one of
the nicest features about R is that it provides a data structure exclusively
designed to handle categorical data: factors.
The term “factor” as used in R for handling categorical variables, comes
from the terminology used in Analysis of Variance, commonly referred to
as ANOVA. In this statistical method, a categorical variable is commonly
referred to as factor and its categories are known as levels. Perhaps this is
not the best terminology but it is the one R uses, which reflects its distinc-
tive statistical origins. Especially for those users without a brackground in
statistics, this is one of R’s idiosyncracies that seems disconcerning at the
beginning. But as long as you keep in mind that a factor is just the object
that allows you to handle a qualitative variable you’ll be fine. In case you
need it, here’s a short mantra to remember: “factors have levels”.

1.1 Creating Factors

To create a factor in R you use the homonym function factor(), which

takes a vector as input. The vector can be either numeric, character or
logical. Let’s see our first example:

1
2

# numeric vector
num_vector <- c(1, 2, 3, 1, 2, 3, 2)

# creating a factor from num_vector

first_factor <- factor(num_vector)

first_factor
## [1] 1 2 3 1 2 3 2
## Levels: 1 2 3

As you can tell from the previous code snippet, factor() converts the
numeric vector num vector into a factor (i.e. a categorical variable) with 3
categories —the so called levels.
You can also obtain a factor from a string vector:
# string vector
str_vector <- c('a', 'b', 'c', 'b', 'c', 'a', 'c', 'b')

str_vector
## [1] "a" "b" "c" "b" "c" "a" "c" "b"
# creating a factor from str_vector
second_factor <- factor(str_vector)

second_factor
## [1] a b c b c a c b
## Levels: a b c

Notice how str vector and second factor are displayed. Even though the
elements are the same in both the vector and the factor, they are printed in
different formats. The letters in the string vector are displayed with quotes,
while the letters in the factor are printed without quotes.
And of course, you can use a logical vector to generate a factor as well:
# logical vector
log_vector <- c(TRUE, FALSE, TRUE, TRUE, FALSE)

# creating a factor from log_vector

third_factor <- factor(log_vector)

third_factor
## [1] TRUE FALSE TRUE TRUE FALSE

CC BY-SA 4.0 Gaston Sanchez A booklet of R factors

## Levels: FALSE TRUE

1.1.1 How R treats factors?

If you’re curious and check the technical R Language Definition, avail-

able online https://cran.r-project.org/manuals.html, you’ll find that
R factors are referred to as compound objects. According to the manual:
“Factors are currently implemented using an integer array to specify the
actual levels and a second array of names that are mapped to the integers.”
Essentially, a factor is internally stored using two arrays: one is an integer
array containing the values of categories, the other array is the “levels”
which has the names of categories which are mapped to the integers.
Under the hood, the way R stores factors is as vectors of integer values.
One way to confirm this is using the function storage.mode()
# storage of factor
storage.mode(first_factor)
## [1] "integer"

This means that we can manipulate factors just like we manipulate vectors.
In addition, many functions for vectors can be applied to factors. For
instance, we can use the function length() to get the number of elements
in a factor:
# factors have length
length(first_factor)
## [1] 7

We can also use the square brackets [ ] to extract or select elements of

a factor. Inside the brackets we specify vectors of indices such as numeric
vectors, logical vectors, and sometimes even character vectors.
# first element
first_factor[1]
## [1] 1
## Levels: 1 2 3
# third element
first_factor[3]
## [1] 3

CC BY-SA 4.0 Gaston Sanchez A booklet of R factors

## Levels: 1 2 3
# second to fourth elements
first_factor[2:4]
## [1] 2 3 1
## Levels: 1 2 3
# last element
first_factor[length(first_factor)]
## [1] 2
## Levels: 1 2 3
# logical subsetting
first_factor[rep(c(TRUE, FALSE), length.out = 7)]
## [1] 1 3 2 2
## Levels: 1 2 3

If you have a factor with named elements, you can also specify the names
of the elements within the brackets:
names(first_factor) <- letters[1:length(first_factor)]
first_factor
## a b c d e f g
## 1 2 3 1 2 3 2
## Levels: 1 2 3
first_factor[c('b', 'd', 'f')]
## b d f
## 2 1 3
## Levels: 1 2 3

However, you should know that factors are NOT really vectors. To see this
you can check the behavior of the functions is.factor() and is.vector()
on a factor:
# factors are not vectors
is.vector(first_factor)
## [1] FALSE
# factors are factors
is.factor(first_factor)
## [1] TRUE

CC BY-SA 4.0 Gaston Sanchez A booklet of R factors

Even a single element of a factor is also a factor:

class(first_factor[1])
## [1] "factor"

So what makes a factor different from a vector? Well, it turns out

that factors have an additional attribute that vectors don’t: levels. And
as you can expect, the class of a factor is indeed "factor" (not "vector").
# attributes of a factor
attributes(first_factor)
## $levels
## [1] "1" "2" "3"
##
## $class
## [1] "factor"
##
## $names
## [1] "a" "b" "c" "d" "e" "f" "g"

Another feature that makes factors so special is that their values (the levels)
are mapped to a set of character values for displaying purposes. This might
seem like a minor feature but it has two important consequences. On the
one hand, this implies that factors provide a way to store character values
very efficiently. Why? Because each unique character value is stored only
once, and the data itself is stored as a vector of integers.
Notice how the numeric value 1 was mapped into the character value "1".
And the same happens for the other values 2 and 3 that are mapped into
the characters "2" and "3".

1.2 Factors and Data Tables

Usually we get data in some file. Rarely is the case when we have to input
data manually. The files can be in text format; they can also be plain text
format. Typical file extensions are csv, tsv, xml, json, or whatever other
format. The most frequent case is either read data as text, or read data in
some type of tabular format (spreadsheet-like table).
For better or worse, when reading tabular data in R, the default behavior of

CC BY-SA 4.0 Gaston Sanchez A booklet of R factors

read.table() and similar functions (e.g. read.csv(), read.fwf(), etc),

is to convert characters into factors. Unless you are totally comfortable
with the way the data file has been structured, I recommend to turn off this
behavior by setting the argument stringsAsFactors = FALSE. In this way
you will have more freedom and flexibility to manipulate data and do string
manipulations. Plus, you can always convert those character variables into
factors or numbers.

There is a caveat with stringsAsFactors = FALSE. The reason why R

converts characters into factors is because of memory efficiency. Internally,
a factor is stored as an integer vector with an attribute levels. In general,
this storage form is more efficient than storing a character vector. A better
way to see this is with an example. We’ll use the popular iris dataset which
contains the categorical variable Species:
# iris data set
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
# Species as factor
class(iris$Species)
## [1] "factor"

Let’s compare the sizes of apparently the same type of data:

# species in two formats
iris_factor <- iris$Species
iris_string <- as.character(iris$Species)

# comparison of memory size

object.size(iris_factor)
## 1192 bytes
object.size(iris_string)
## 1400 bytes

Note that the size of the factor iris factor is less than the character

CC BY-SA 4.0 Gaston Sanchez A booklet of R factors

vector iris string.

For objects with a few number of elements, character vectors will be of
smaller size than factors. For instance, consider the following example.
Here we have a character vector of color names with 5 elements, and a
factor derived from the vector of colors.
# vector of colros
colrs <- c('blue', 'red', 'blue', 'green', 'red')

# factor
cols <- factor(colrs)

If we compare the size of the objects, you’ll notice that the vector occupies
much less memory than the factor
# comparing sizes
object.size(colrs)
## 232 bytes
object.size(cols)
## 608 bytes

1.2.1 What is the advantage of R factors?

Every time I teach about factors, there is inevitably one student who asks a
very pertinent question: Why do we want to use factors? Isn’t it redundant
to have a factor object when there are already character or integer vectors?
I have two answers to this question.
The first has to do with the storage of factors. Storing a factor as in-
tegers will usually be more efficient than storing a character vector. As
we’ve seen, this is an important issue especially when factors are of con-
siderable size. The second reason has to do with ordinal variables. Qual-
itative data can be classified into nominal and ordinal variables. Nominal
variables could be easily handled with character vectors. In fact, nom-
inal means name (values are just names or labels), and there’s no nat-
ural order among the categories. A different case is when we have or-
dinal variables, like sizes "small", "medium", "large" or college years
"freshman", "sophomore", "junior", "senior". In these cases we are
still using names of categories, but they can be arranged in increasing or
decreasing order. In other words, we can rank the categories since they

CC BY-SA 4.0 Gaston Sanchez A booklet of R factors

have a natural order: small is less than medium which is less than large.
Likewise, freshman comes first, then sophomore, followed by junior, and
finally senior.
So here’s an important question: How do we keep the order of categories
in an ordinal variable? We can use a character vector to store the values.
But a character vector does not allow us to store the ranking of categories.
The solution in R comes via factors. We can use factors to define ordinal
variables, like the following example:
sizes <- factor(c('sm', 'md', 'lg', 'sm', 'md'),
levels = c('sm', 'md', 'lg'),
ordered = TRUE)

sizes
## [1] sm md lg sm md
## Levels: sm < md < lg

We’ll take in more detail about ordinal factors in the next chapter. For
now, just keep in mind the sizes example. As you can tell, sizes has
ordered levels, clearly identifying the first category "sm", the second one
"md", and the third one "lg"

1.2.2 When to factor?

As mentioned above, all the reading table functions—e.g. read.table(),

read.csv(), read.delim(), read.fwf(), etc.—import data tables as ob-
jects of class data.frame. In turn, the creation of data frames (e.g. via
data.frame()) converts, by default, character strings into factors. We’ve
talked about the stringsAsFactors argument of the function data.frame().
This is the argument that allows us to turn on or turn off the conversion
of character strings into factors.
So here’s another question: When do we want characters to become
factors? There is no universal answer to this question. The decision
of whether to convert strings into factors is going to depend on various
aspects. For instance, the purpose of the analysis, or the type of variable
that contains the strings. Sometimes it will make sense to have a variable as
factor, like gender (e.g. male, female) or ethnicty (e.g. hispanic, african-
american, native-american). In other cases it does not make much sense to
create a factor from a variable containing addresses or telephone numbers

CC BY-SA 4.0 Gaston Sanchez A booklet of R factors

or names of individuals.
There is also a another aspect related to the question of whether to convert
strings as factors or not. In practice, most of the times we’ll be working
with some data set. The data we’ll be provided with is what I call the raw
data. Pretty much all real-life data analysis projects require the analyst to
process, clean, and transform the raw data into a clean data version, also
called tidy data. Since the processing-cleaning phase involves manipulating
text, formatting, grouping, changing scales, splitting strings, and a wide
variety of operations, it is better to import the raw data and leave strings
as characters. Once the clean data set has been created, then we can work
with this version and convert some of the strings into factors.

CC BY-SA 4.0 Gaston Sanchez A booklet of R factors

Chapter 2

A closer look at factor()

Since working with categorical data in R typically involves working with

factors, you should become familiar with the variety of functions related
with them. In the following sections we’ll cover a bunch of topics and
details about factors so you can be better prepared to deal with any type
of categorical data.

2.1 Function factor()

Given the fundamental role played by the function factor() we need to

pay a closer look at its arguments. If you check the documentation —
help(factor)— you’ll see that the usage of the function factor() is:
factor(x = character(), levels, labels = levels,
exclude = NA, ordered = is.ordered(x), nmax = NA)
with the following arguments:
• x a vector of data
• levels an optional vector for the categories
• labels an optional character vector of labels for the levels
• exclude a vector of values to be excluded when forming the set of
levels

11
12

• ordered logical value to indicate if the levels should be regarded as

ordered
• nmax an upper bound on the number of levels
The main argument of factor() is the input vector x. The next argument
is levels, followed by labels, both of which are optional arguments. Al-
though you won’t always be providing values for levels and labels, it is
important to understand how R handles these arguments by default.

Argument levels. If levels is not provided (which is what happens in

most cases), then R assigns the unique values in x as the category levels.
For example, consider our numeric vector from the first example: num vector
contains unique values 1, 2, and 3.
# numeric vector
num_vector <- c(1, 2, 3, 1, 2, 3, 2)

# creating a factor from num_vector

first_factor <- factor(num_vector)

first_factor
## [1] 1 2 3 1 2 3 2
## Levels: 1 2 3

Now imagine we want to have levels 1, 2, 3, 4, and 5. This is how you

can define the factor with an extended set of levels:
# numeric vector
num_vector
## [1] 1 2 3 1 2 3 2
# defining levels
one_factor <- factor(num_vector, levels = 1:5)
one_factor
## [1] 1 2 3 1 2 3 2
## Levels: 1 2 3 4 5

Although the created factor only has values between 1 and 3, the levels
range from 1 to 5. This can be useful if we plan to add elements whose
values are not in the input vector num vector. For instance, you can append
two more elements to one factor with values 4 and 5 like this:

CC BY-SA 4.0 Gaston Sanchez A booklet of R factors

# adding values 4 and 5

one_factor[c(8, 9)] <- c(4, 5)
one_factor
## [1] 1 2 3 1 2 3 2 4 5
## Levels: 1 2 3 4 5

If you attempt to insert an element having a value that is not in the prede-
fined set of levels, R will insert a missing value (<NA>) instead, and you’ll
get a warning message like the one below:
# attempting to add value 6 (not in levels)
one_factor[1] <- 6
## Warning in ‘[<-.factor‘(‘*tmp*‘, 1, value = 6): invalid factor level,
NA generated
one_factor
## [1] <NA> 2 3 1 2 3 2 4 5
## Levels: 1 2 3 4 5

Argument labels. Another very useful argument is labels, which al-

lows you to provide a string vector for naming the levels in a different
way from the values in x. Let’s take the numeric vector num vector again,
and say we want to use words as labels instead of numeric values. Here’s
how you can create a factor with predefined labels:
# defining labels
num_word_vector <- factor(num_vector, labels = c("one", "two", "three"))

num_word_vector
## [1] one two three one two three two
## Levels: one two three

Argument exclude. If you want to ignore some values of the input vector
x, you can use the exclude argument. You just need to provide those values
which will be removed from the set of levels.
# excluding level 3
factor(num_vector, exclude = 3)
## [1] 1 2 <NA> 1 2 <NA> 2
## Levels: 1 2

CC BY-SA 4.0 Gaston Sanchez A booklet of R factors

# excluding levels 1 and 3

factor(num_vector, exclude = c(1,3))
## [1] <NA> 2 <NA> <NA> 2 <NA> 2
## Levels: 2

The side effect of exclude is that it returns a missing value (<NA>) for each
element that was excluded, which is not always what we want. Here’s one
way to remove the missing values when excluding 3:
# excluding level 3
num_fac12 <- factor(num_vector, exclude = 3)

# oops, we have some missing values

num_fac12
## [1] 1 2 <NA> 1 2 <NA> 2
## Levels: 1 2
# removing missing values
num_fac12[!is.na(num_fac12)]
## [1] 1 2 1 2 2
## Levels: 1 2

2.1.1 Manipulating Categories or Levels

We’ve seen how to work with the argument levels inside the function
factor(). But that’s not the only way in which you can manipulate the
levels of a factor. Closely related to factor() there are two other important
sibling functions: levels() and nlevels().
The function levels() lets you have access to the levels attribute of
a factor. This means that you can use levels() for both: getting the
categories, and setting the categories.

Getting levels. To get the different values for the categories in a factor
you just need to apply levels() on a factor:
# levels()
levels(first_factor)
## [1] "1" "2" "3"
levels(third_factor)

CC BY-SA 4.0 Gaston Sanchez A booklet of R factors

## [1] "FALSE" "TRUE"

Setting levels. If what you want is to specify the levels attribute, you
must use the function levels() followed by the assignment operator <-.
Suppose that we want to change the levels of first factor and express
them in roman numerals. You can achieve this with:
# copy of first factor
first_factor_copy <- first_factor
first_factor_copy
## [1] 1 2 3 1 2 3 2
## Levels: 1 2 3
# setting new levels
levels(first_factor_copy) <- c("I", "II", "III")
first_factor_copy
## [1] I II III I II III II
## Levels: I II III

Number of Levels. Besides the function levels() there’s another very

handy function: nlevels(). This function allows you to return the number
of levels of a factor. In other words, nlevels() returns the length of the
attribute levels:
# nlevels()
nlevels(first_factor)
## [1] 3
# equivalent to
length(levels(first_factor))
## [1] 3

Don’t confuse length() with nlevels(). The former returns the number
of elements in a factor, while the latter returns the number of levels.

Merging Levels. Sometimes we may need to “merge” or collapse two or

more different levels into one single level. We can achieve this by using the
function levels() and assigning a new vector of levels containing repeated
values for those categories that we wish to merge.

CC BY-SA 4.0 Gaston Sanchez A booklet of R factors

For example, say we want to combine categories I and III into a new level
I+III. Here’s how to do it:
# nlevels()
levels(first_factor) <- c("I+III", "II", "I+III")

# equivalent to
first_factor
## [1] I+III II I+III I+III II I+III II
## Levels: I+III II

Note that the length of the vector specifying the merged categories will be
the same as the number of levels.

Factors with missing values. Missing values are ubiquitous and they
can appear in any data set. This means that we can have categorical
variables with missing values. For instance, let’s say we have a vector
drinks with the type of drink ordered by a group of 7 individuals:
# vector of drinks
drink_type <- c('water', 'water', 'beer', 'wine', 'soda', 'water', NA)
drink_type
## [1] "water" "water" "beer" "wine" "soda" "water" NA

As you can tell from the vector drink type, there is a missing value for the
last element. Now let’s convert this vector into a factor:
# drinks factor
drinks <- factor(drink_type)
drinks
## [1] water water beer wine soda water <NA>
## Levels: beer soda water wine

Missing values in factors are displayed as <NA> instead of just NA.

What if you want to consider a missing value as another level? To do this,
first you need to add a new level to the factor. For instance, you can take
the current levels and append a string "NA" that will be the level of the
missing values:
# add extra level 'NA'
levels(drinks) <- c(levels(drinks), "NA")

drinks

CC BY-SA 4.0 Gaston Sanchez A booklet of R factors

## [1] water water beer wine soda water <NA>

## Levels: beer soda water wine NA

Now that there is a level dedicated to missing values, you can assign a string
"NA" to the actual missing values:
drinks[is.na(drinks)] <- "NA"

drinks
## [1] water water beer wine soda water NA
## Levels: beer soda water wine NA

Notice that drinks has now a level for NA. Notice also that this label is not
anymore displayed as <NA>. In other words, this is not an R missing value
anymore. It is just another category or level:
is.na(drinks)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE

2.1.2 Ordinal factors

By default, factor() creates a nominal categorical variable, not an ordinal.

One way to check that you have a nominal factor is to use the function
is.ordered(), which returns TRUE if its argument is an ordinal factor.
# ordinal factor?
is.ordered(num_vector)
## [1] FALSE

If you want to specify an ordinal factor you must use the ordered argu-
ment of factor(). This is how you can generate an ordinal value from
num vector:
# ordinal factor from numeric vector
ordinal_num <- factor(num_vector, ordered = TRUE)
ordinal_num
## [1] 1 2 3 1 2 3 2
## Levels: 1 < 2 < 3

As you can tell from the snippet above, the levels of ordinal factor are
displayed with less-than symbols ’<’, which means that the levels have an
increasing order. We can also get an ordinal factor from our string vector:

CC BY-SA 4.0 Gaston Sanchez A booklet of R factors

# ordinal factor from character vector

ordinal_str <- factor(str_vector, ordered = TRUE)
ordinal_str
## [1] a b c b c a c b
## Levels: a < b < c

In fact, when you set ordered = TRUE, R sorts the provided values in al-
phanumeric order. If you have the following alphanumeric vector ("a1",
"1a", "1b", "b1"), what do you think will be the generated ordered fac-
tor? Let’s check the answer:
# alphanumeric vector
alphanum <- c("a1", "1a", "1b", "b1")

# ordinal factor from character vector

ordinal_alphanum <- factor(alphanum, ordered = TRUE)
ordinal_alphanum
## [1] a1 1a 1b b1
## Levels: 1a < 1b < a1 < b1

An alternative way to specify an ordinal variable is by using the function

ordered(), which is just a convenient wrapper for
factor(x, ..., ordered = TRUE):
# ordinal factor with ordered()
ordered(num_vector)
## [1] 1 2 3 1 2 3 2
## Levels: 1 < 2 < 3
# same as using 'ordered' argument
factor(num_vector, ordered = TRUE)
## [1] 1 2 3 1 2 3 2
## Levels: 1 < 2 < 3

A word of caution. Don’t confuse the function ordered() with order().

They are not equivalent. order() arranges a vector into ascending or de-
scending order, and returns the sorted vector. ordered(), as we’ve seen, is
used to get ordinal factors.

Of course, you won’t always be using the default order provided by the
functions factor(..., ordered = TRUE) or ordered(). Sometimes you
want to determine categories according to a different order.

CC BY-SA 4.0 Gaston Sanchez A booklet of R factors

For example, let’s take the values of str vector and let’s assume that we
want them in descending order, that is, c < b < a. How can you do that?
Easy, you just need to specify the levels in the order you want them and
set ordered = TRUE (or use ordered()):
# setting levels with specified order
factor(str_vector, levels = c("c", "b", "a"), ordered = TRUE)
## [1] a b c b c a c b
## Levels: c < b < a
# equivalently
ordered(str_vector, levels = c("c", "b", "a"))
## [1] a b c b c a c b
## Levels: c < b < a

Here’s another example. Consider a set of size values "xs" extra-small,

"sm" small, "md" medium, "lg" large, and "xl" extra-large. If you have a
vector with size values you can create an ordinal variable as follows:
# vector of sizes
sizes <- c("sm", "xs", "xl", "lg", "xs", "lg")

# setting levels with specified order

ordered(sizes, levels = c("xs", "sm", "md", "lg", "xl"))
## [1] sm xs xl lg xs lg
## Levels: xs < sm < md < lg < xl

Notice that when you create an ordinal factor, the given levels will always
be considered in an increasing order. This means that the first value of
levels will be the smallest one, then the second one, and so on. The last
category, in turn, is taken as the one at the top of the scale.
Now that we have several nominal and ordinal factors, we can compare the
behavior of is.ordered() on two factors:
# is.ordered() on an ordinal factor
ordinal_str
## [1] a b c b c a c b
## Levels: a < b < c
is.ordered(ordinal_str)
## [1] TRUE
# is.ordered() on a nominal factor

CC BY-SA 4.0 Gaston Sanchez A booklet of R factors

second_factor
## [1] a b c b c a c b
## Levels: a b c
is.ordered(second_factor)
## [1] FALSE

2.1.3 Unclassing factors

We’ve mentioned that factors are stored as vectors of integers (for efficiency
reasons). But we also said that factors are more than vectors. Even though
a factor is displayed with string labels, the way it is stored internally is as
integers. Why is this important to know? Because there will be occasions
in which you’ll need to know exactly what numbers are associated to each
level values.
Imagine you have a factor with levels 11, 22, 33, 44.
# factor
xfactor <- factor(c(22, 11, 44, 33, 11, 22, 44))
xfactor
## [1] 22 11 44 33 11 22 44
## Levels: 11 22 33 44

To obtain the integer vector associated to xfactor you can use the function
unclass():
# unclassing a factor
unclass(xfactor)
## [1] 2 1 4 3 1 2 4
## attr(,"levels")
## [1] "11" "22" "33" "44"

As you can see, the levels ("11" "22" "33" "44") were mapped to the
vector of integers (1 2 3 4).
An alternative option is to simply apply as.numeric() or as.integer()
instead of using unclass():
# equivalent to unclass
as.integer(xfactor)
## [1] 2 1 4 3 1 2 4

CC BY-SA 4.0 Gaston Sanchez A booklet of R factors

# equivalent to unclass
as.numeric(xfactor)
## [1] 2 1 4 3 1 2 4

Although rarely used, there can be some cases in which what you need to
do is revert the integer values in order to get the original factor levels. This
is only possible when the levels of the factor are themselves numeric. To
accomplish this use the following command:
# recovering numeric levels
as.numeric(levels(xfactor))[xfactor]
## [1] 22 11 44 33 11 22 44

2.1.4 Reordering factors

Sometimes it’s not enough with creating an ordinal variable or setting the
order of the categories. Occasionally, you would like to reorder a factor.
For this purpose you can use the function reorder()
vowels <- c('a', 'b', 'c', 'd', 'e')
set.seed(975)
vowels_fac <- factor(sample(vowels, size = 20, replace = TRUE))
#reorder(vowels_fac, count, median)

# reordering a factor
bymedian <- with(InsectSprays, reorder(spray, count, median))

Reordering levels. Another useful function to reorder the levels of a

factor is relevel(). This function has two main arguments: an unordered
factor x, and a reference level ref. What relevel() does is to re-order the
levels of a factor so that the level specified by ref will be the first level,
while the others are moved down.
For example, consider one factor, which is an unordered factor with levels
1, 2, 3, 4, 5. We can use relevel() to specify ref=5 as the reference level.
This will cause 5 to be the first level, while the rest of the levels will be
moved down:
one_factor
## [1] <NA> 2 3 1 2 3 2 4 5

CC BY-SA 4.0 Gaston Sanchez A booklet of R factors

## Levels: 1 2 3 4 5
# reorder levels of unordered factor
relevel(one_factor, ref = 5)
## [1] <NA> 2 3 1 2 3 2 4 5
## Levels: 5 1 2 3 4

2.1.5 Dropping levels

”There are two tasks that are often performed on factors. One is to drop
unused levels; this can be achieved by a call to factor() since factor(y)
will drop any unused levels from y if y is a factor.”
”The second task is to coarsen the levels of a factor, that is, group two or
more of them together into a single new level.”
y <- sample(letters[1:5], 20, rep = TRUE)
v <- as.factor(y)
xx <- list(I = c("a", "e"), II = c("b", "c", "d"))
levels(v) <- xx
v
## [1] II I I I I I II II II II I II I I II II I I I I
## Levels: I II

CC BY-SA 4.0 Gaston Sanchez A booklet of R factors

Chapter 3

More about Factors

So far we’ve seen a comprehensive review of the functions factor(), and

levels(). Now we’ll talk about other accessory functions for creating and
handling factors in R.

3.1 Categorizing a quantitative variable

A common data manipulation task is: how to get a categorical variable from
a quantitative variable? In other words, how to discretize or categorize a
quantitative variable?
For this kind of common task R provides the handy function cut(). The
idea is to cut values of a numeric input vector into intervals, which in turn
will be the levels of the generated factor. The usage of cut() is:
cut(x, breaks, labels = NULL, include.lowest = FALSE,
right = TRUE, dig.lab = 3, ordered_result = FALSE, ...)
with the following arguments:
• x a numeric vector which is to be converted to a factor by cutting.
• breaks numeric vector giving the number of intervals into which x is
to be cut.
• labels labels for the levels of the resulting category.

23
24

• include.lowest logical indicating if values equal to the lowest ’breaks’

point should be included.
• right logical, indicating if the intervals should be closed on the right.
• dig.lab integer which is used when labels are not given.
• ordered result logical: should the result be an ordered factor?

Example Here’s an example. The following code creates a numeric vec-

tor, income, that generates some fake values of a hypothetical variable
income.
# cutting a quantitative variable
set.seed(321)
income <- round(runif(n = 1000, min = 100, max = 500), 2)

To convert income into a factor we use cut(). The first argument is the
input vector (income in this case). The argument breaks is used to indicate
the number of categories or levels of the output factor (e.g. 10)
# cutting a quantitative variable
income_level <- cut(x = income, breaks = 10)

levels(income_level)
## [1] "(99.7,140]" "(140,180]" "(180,220]" "(220,260]" "(260,300]"
## [6] "(300,340]" "(340,380]" "(380,420]" "(420,460]" "(460,500]"

As you can tell, income level has 10 levels; each level formed by an in-
terval. Moreover, the intervals are all of the same form: a range of values
with the lower bound surrounded by a parenthesis, and the upper bound
surrounded by a bracket.
You can inspect the produced factor income level and check the frequen-
cies with table()
table(income_level)
## income_level
## (99.7,140] (140,180] (180,220] (220,260] (260,300] (300,340]
## 85 106 113 84 87 97
## (340,380] (380,420] (420,460] (460,500]
## 115 103 98 112

By default, cut() has its argument right set to TRUE. This means that the
intervals are open on the left (and closed on the right):

CC BY-SA 4.0 Gaston Sanchez A booklet of R factors

# using other cutting break points

income_breaks <- seq(from = 100, to = 500, by = 50)
income_a <- cut(x = income, breaks = income_breaks)
table(income_a)
## income_a
## (100,150] (150,200] (200,250] (250,300] (300,350] (350,400] (400,450]
## 111 139 122 103 124 135 131
## (450,500]
## 135
sum(table(income_a))
## [1] 1000

To change the default way in which intervals are open and closed you can
set right = FALSE. This option produces intervals closed on the left and
open on the right:
# using other cutting break points
income_b <- cut(x = income, breaks = income_breaks, right = FALSE)
table(income_b)
## income_b
## [100,150) [150,200) [200,250) [250,300) [300,350) [350,400) [400,450)
## 111 139 122 103 124 135 131
## [450,500)
## 135
sum(table(income_b))
## [1] 1000

You can change the labels of the levels using the argument labels. For
example, let’s say we want to name the resulting levels with letters. The
first level [100,150) will be changed to "a", the second level [150,200) will
be changed to "b", and so on.
income_c <- cut(x = income, breaks = income_breaks,
labels = letters[1:(length(income_breaks)-1)])

table(income_c)
## income_c
## a b c d e f g h
## 111 139 122 103 124 135 131 135

CC BY-SA 4.0 Gaston Sanchez A booklet of R factors

3.1.1 Factor into indicators

Another frequent operation is to decompose a categorical variable into in-

dicators, also known as dummy variables. The idea is to create a table
(rectangular array or matrix) with as many columns as levels. Each col-
umn is a binary vairbale (0, or 1).
You can think of this as “unfolding” a factor. Other authors call it creating
a disjunctive table. Each row has only one 1, and the rest of values are
zeros. The sum of values in a column equals the number of elements in
that particular category.
Say you have a factor with category temperatures hot and cold. One way
to obtain dummy indicators for each temperatur level is to construct a
matrix with as many columns as categories to binarize:
# example
hot_cold = gl(n = 2, k = 3, labels = c('hot', 'cold'))

hot_cold_mat = matrix(0, nrow = length(hot_cold), ncol = nlevels(hot_cold))

hot_cold_mat[hot_cold == 'hot', 1] = 1
hot_cold_mat[hot_cold == 'cold', 2] = 1
dimnames(hot_cold_mat) = list(1:length(hot_cold), c('hot', 'cold'))
hot_cold_mat
## hot cold
## 1 1 0
## 2 1 0
## 3 1 0
## 4 0 1
## 5 0 1
## 6 0 1
# sum of columns equals elements in each category
colSums(hot_cold_mat)
## hot cold
## 3 3

3.1.2 Generating Factors Levels with gl()

In addition to the function factor(), there’s a secondary function that

you can use to create factors with a simple structure: gl(). This function
generates factors by specifying a pattern of levels. Here’s its usage:

CC BY-SA 4.0 Gaston Sanchez A booklet of R factors

gl(n, k, length = n*k, labels = seq_len(x), ordered = FALSE)

with the following arguments:
• n an integer giving the number of levels.
• k an integer giving the number of replications.
• length an integer giving the length of the result.
• labels an optional vector of labels for the resulting factor levels.
• ordered logical indicating whether the result should be ordered or
not.
Here’s an example on how to use gl():
# factor with gl()
num_levs = 4
num_reps = 3
simple_factor = gl(num_levs, num_reps)
simple_factor
## [1] 1 1 1 2 2 2 3 3 3 4 4 4
## Levels: 1 2 3 4

The main inputs of gl() are n and k, that is, the number of levels and the
number of replications of each level. Especially for working with data under
the approach of Design of Experiments (DoE), gl() can be very useful.
Here’s another example setting the arguments labels and length:
# another factor with gl()
girl_boy = gl(2, 4, labels = c("girl", "boy"), length = 7)
girl_boy
## [1] girl girl girl girl boy boy boy
## Levels: girl boy

By default, the total number of elements is 8 (n=2 × k=4). Four girl’s and
four boy’s. But since we set the argument length = 7, we only got three
boy’s.

CC BY-SA 4.0 Gaston Sanchez A booklet of R factors

Lenguaje R C3
No ratings yet
Lenguaje R C3
19 pages
R Factors
No ratings yet
R Factors
12 pages
Starting With R - 3
No ratings yet
Starting With R - 3
1 page
Dispensa - Completa Categorical Data
No ratings yet
Dispensa - Completa Categorical Data
211 pages
Categorical Data Courses
No ratings yet
Categorical Data Courses
191 pages
Introduction To Categorical Data AA 2016-2017
No ratings yet
Introduction To Categorical Data AA 2016-2017
70 pages
R Module 7 - Data Classes
No ratings yet
R Module 7 - Data Classes
45 pages
R Factors and Tables Guide
No ratings yet
R Factors and Tables Guide
10 pages
Data Types in R Programming
No ratings yet
Data Types in R Programming
9 pages
Advance R Prog.-1
No ratings yet
Advance R Prog.-1
24 pages
R Data Structures - 07 - 4
No ratings yet
R Data Structures - 07 - 4
27 pages
Chapter 6 Slides
No ratings yet
Chapter 6 Slides
34 pages
R-Training For Print
No ratings yet
R-Training For Print
11 pages
Ex 3 Univariate Data
No ratings yet
Ex 3 Univariate Data
3 pages
IDS Notes Unit 3
No ratings yet
IDS Notes Unit 3
14 pages
R Vectors and Factors Tutorial
No ratings yet
R Vectors and Factors Tutorial
2 pages
R-Tutorial - Introduction
No ratings yet
R-Tutorial - Introduction
30 pages
Basic Data Types
No ratings yet
Basic Data Types
48 pages
Introduction To Spatial Data Handling in R
No ratings yet
Introduction To Spatial Data Handling in R
25 pages
R Session A
No ratings yet
R Session A
107 pages
Data Types
No ratings yet
Data Types
27 pages
Factors and Tables - Un
No ratings yet
Factors and Tables - Un
44 pages
Chapter 3
No ratings yet
Chapter 3
22 pages
R Factors: Categorical vs Continuous
No ratings yet
R Factors: Categorical vs Continuous
4 pages
Biostat S1 Handout
No ratings yet
Biostat S1 Handout
7 pages
R Programming - Lec 7
No ratings yet
R Programming - Lec 7
6 pages
Variables, Data Frames Etc.: R.M. Ripley
No ratings yet
Variables, Data Frames Etc.: R.M. Ripley
26 pages
R Pres
No ratings yet
R Pres
53 pages
All v2 Basic Statistics Using R
No ratings yet
All v2 Basic Statistics Using R
241 pages
r22 Unit3 Factors Dataframes
No ratings yet
r22 Unit3 Factors Dataframes
13 pages
Introduction To R
No ratings yet
Introduction To R
20 pages
A Report On R Name-Kaveena ROLL NO-12EE46
No ratings yet
A Report On R Name-Kaveena ROLL NO-12EE46
10 pages
Factors Arrays Matrices R
No ratings yet
Factors Arrays Matrices R
3 pages
A Crash Course in R - Intro To Statistical Programming
No ratings yet
A Crash Course in R - Intro To Statistical Programming
53 pages
Aim: Write A R Script To Create and Use Factors & Create Subset
No ratings yet
Aim: Write A R Script To Create and Use Factors & Create Subset
2 pages
R Data Types 8
No ratings yet
R Data Types 8
7 pages
R-Programming: To See The Working Directory in R Studio
No ratings yet
R-Programming: To See The Working Directory in R Studio
17 pages
Eda
No ratings yet
Eda
188 pages
R Nuts and Bolts
No ratings yet
R Nuts and Bolts
9 pages
P1 - NotesOnR
No ratings yet
P1 - NotesOnR
17 pages
R Basics for Economics Students
No ratings yet
R Basics for Economics Students
7 pages
Introduction To R PDF
No ratings yet
Introduction To R PDF
56 pages
Unit - Iii: R Vectors
No ratings yet
Unit - Iii: R Vectors
16 pages
Unit 1 Factor
No ratings yet
Unit 1 Factor
9 pages
Factors Factors: LM GLM
No ratings yet
Factors Factors: LM GLM
3 pages
01 IntroSlides
No ratings yet
01 IntroSlides
43 pages
Lecture 1
No ratings yet
Lecture 1
35 pages
Unit 2
No ratings yet
Unit 2
32 pages
R - A Practical Course
No ratings yet
R - A Practical Course
42 pages
N2 Data in R
No ratings yet
N2 Data in R
7 pages
Modelling With R
No ratings yet
Modelling With R
3 pages
2.5 Factors
No ratings yet
2.5 Factors
1 page
Chapter 3 p2
No ratings yet
Chapter 3 p2
80 pages
Data Types in R
No ratings yet
Data Types in R
8 pages
Week3 2020
No ratings yet
Week3 2020
20 pages
Programming in R
No ratings yet
Programming in R
2 pages
Basics of R Programming - Part 2
No ratings yet
Basics of R Programming - Part 2
7 pages
Outline 5
No ratings yet
Outline 5
2 pages
Tinggi Tanaman v2 New
No ratings yet
Tinggi Tanaman v2 New
5 pages
Kontrol Diri
No ratings yet
Kontrol Diri
2 pages
Sapi
No ratings yet
Sapi
2 pages
Benoit - Text Analysis in R - 2018
No ratings yet
Benoit - Text Analysis in R - 2018
25 pages
SQL Injection: Not Only AND 1 1: Bernardo Damele Assumpção Guimarães
No ratings yet
SQL Injection: Not Only AND 1 1: Bernardo Damele Assumpção Guimarães
41 pages
Introducing Linux: A Short Expedition To The Environment
No ratings yet
Introducing Linux: A Short Expedition To The Environment
54 pages
SEO Guide for Publishers
No ratings yet
SEO Guide for Publishers
21 pages
Veyon User Manual
No ratings yet
Veyon User Manual
24 pages
Strategies For Increasing Citation of Our Scientific Articles/Papers
No ratings yet
Strategies For Increasing Citation of Our Scientific Articles/Papers
21 pages
After Install
No ratings yet
After Install
9 pages
Journal of Statistical Software: Reviewer: Dirk Eddelbuettel Debian Project
No ratings yet
Journal of Statistical Software: Reviewer: Dirk Eddelbuettel Debian Project
2 pages
Solar
No ratings yet
Solar
6 pages
Interactive Applications For Modeling and Analysis With Shiny
No ratings yet
Interactive Applications For Modeling and Analysis With Shiny
15 pages
Prism Central Guide Prism v5 - 10 PDF
No ratings yet
Prism Central Guide Prism v5 - 10 PDF
528 pages
PM Debug Info
No ratings yet
PM Debug Info
131 pages
Global Success 6-U8
No ratings yet
Global Success 6-U8
65 pages
Security Audits & Playbooks Guide
No ratings yet
Security Audits & Playbooks Guide
26 pages
Demo Lesson Exemplar COT2 JUMAR
No ratings yet
Demo Lesson Exemplar COT2 JUMAR
3 pages
Kotlin for Backend: A Team's Journey
No ratings yet
Kotlin for Backend: A Team's Journey
13 pages
F
No ratings yet
F
502 pages
02 Unit 8 Assignment
No ratings yet
02 Unit 8 Assignment
4 pages
Math MCQs: Relations & Functions
No ratings yet
Math MCQs: Relations & Functions
3 pages
Exp1.1 Breadboard
No ratings yet
Exp1.1 Breadboard
29 pages
The Abuse of God's Grace - Nicholas Clagett
No ratings yet
The Abuse of God's Grace - Nicholas Clagett
362 pages
College of Education
No ratings yet
College of Education
6 pages
21st Literarygenres DD
No ratings yet
21st Literarygenres DD
65 pages
Journal of Cuneiform Studies-Vol. 58-2006 PDF
No ratings yet
Journal of Cuneiform Studies-Vol. 58-2006 PDF
143 pages
FortiSIEM Agent 7.1.4 Release Notes
No ratings yet
FortiSIEM Agent 7.1.4 Release Notes
7 pages
B.ing S2L - Rev 2324
No ratings yet
B.ing S2L - Rev 2324
191 pages
Russian Nuclear Orthodoxy Religion Politics and Strategy 1st Edition Dmitry Adamsky Instant Download
100% (5)
Russian Nuclear Orthodoxy Religion Politics and Strategy 1st Edition Dmitry Adamsky Instant Download
61 pages
Short Story and Poetry Lesson Activities
No ratings yet
Short Story and Poetry Lesson Activities
67 pages
Past Simple Issatayeva A
No ratings yet
Past Simple Issatayeva A
16 pages
Reinprep Completed
No ratings yet
Reinprep Completed
25 pages
Wago Linux
No ratings yet
Wago Linux
248 pages
Understanding God's "Omni" Attributes
No ratings yet
Understanding God's "Omni" Attributes
8 pages
Post-War Resilience in Nigeria
0% (1)
Post-War Resilience in Nigeria
8 pages
Something's Missing: A Discussion Between Ernst Bloch and Theodor W. Adorno On The Contradictions of Utopian Longing
No ratings yet
Something's Missing: A Discussion Between Ernst Bloch and Theodor W. Adorno On The Contradictions of Utopian Longing
63 pages
Chebyshev Polynomials Explained
No ratings yet
Chebyshev Polynomials Explained
6 pages
CTY3 Extra Grammar Exercises Unit 1
No ratings yet
CTY3 Extra Grammar Exercises Unit 1
6 pages
Elastic Security Fundamentals Siem
No ratings yet
Elastic Security Fundamentals Siem
1 page
Vacuous
No ratings yet
Vacuous
2 pages
Marcom Midterm Final Exam
No ratings yet
Marcom Midterm Final Exam
2 pages
"Speaking Board Game: Verb 'To Be'"
No ratings yet
"Speaking Board Game: Verb 'To Be'"
1 page