15.14 Recoding a Continuous Variable to a Categorical Variable
15.14.2 Solution
Use the cut()
function. In this example, we’ll use the PlantGrowth
data set and recode the continuous variable weight
into a categorical variable, wtclass
, using the cut()
function:
PlantGrowth
pg <-$wtclass <- cut(pg$weight, breaks = c(0, 5, 6, Inf))
pg
pg#> weight group wtclass
#> 1 4.17 ctrl (0,5]
#> 2 5.58 ctrl (5,6]
#> ...<26 more rows>...
#> 29 5.80 trt2 (5,6]
#> 30 5.26 trt2 (5,6]
15.14.3 Discussion
For three categories we specify four bounds, which can include Inf
and -Inf
. If a data value falls outside of the specified bounds, it’s categorized as NA
. The result of cut()
is a factor, and you can see from the example that the factor levels are named after the bounds.
To change the names of the levels, set the labels:
$wtclass <- cut(pg$weight, breaks = c(0, 5, 6, Inf),
pglabels = c("small", "medium", "large"))
pg#> weight group wtclass
#> 1 4.17 ctrl small
#> 2 5.58 ctrl medium
#> ...<26 more rows>...
#> 29 5.80 trt2 medium
#> 30 5.26 trt2 medium
As indicated by the factor levels, the bounds are by default open on the left and closed on the right. In other words, they don’t include the lowest value, but they do include the highest value. For the smallest category, you can have it include both the lower and upper values by setting include.lowest=TRUE
. In this example, this would result in 0 values going into the small category; otherwise, 0 would be coded as NA
.
If you want the categories to be closed on the left and open on the right, set right = FALSE:
cut(pg$weight, breaks = c(0, 5, 6, Inf), right = FALSE)
#> [1] [0,5) [5,6) [5,6) [6,Inf) [0,5) [0,5) [5,6) [0,5) [5,6)
#> [10] [5,6) [0,5) [0,5) [0,5) [0,5) [5,6) [0,5) [6,Inf) [0,5)
#> [19] [0,5) [0,5) [6,Inf) [5,6) [5,6) [5,6) [5,6) [5,6) [0,5)
#> [28] [6,Inf) [5,6) [5,6)
#> Levels: [0,5) [5,6) [6,Inf)
15.14.4 See Also
To recode a categorical variable to another categorical variable, see Recipe 15.13.