15.2 Getting Information About a Data Structure
15.2.2 Solution
Use the str()
function:
str(ToothGrowth)
#> 'data.frame': 60 obs. of 3 variables:
#> $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
#> $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
#> $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
This tells us that ToothGrowth
is a data frame with three columns, len
, supp
, and dose
. len
and dose
contain numeric values, while supp
is a factor with two levels.
Another useful function is the summary()
function:
summary(ToothGrowth)
#> len supp dose
#> Min. : 4.20 OJ:30 Min. :0.500
#> 1st Qu.:13.07 VC:30 1st Qu.:0.500
#> Median :19.25 Median :1.000
#> Mean :18.81 Mean :1.167
#> 3rd Qu.:25.27 3rd Qu.:2.000
#> Max. :33.90 Max. :2.000
Instead of showing you the first few values of each column as str()
does, summary()
provides basic descriptive statistics (the minimum, maximum, median, mean, and first & third quartile values) for numeric variables, and tells you the number of values corresponding to each character value or factor level if it is a character or factor variable.
15.2.3 Discussion
The str()
function is very useful for finding out more about data structures. One common source of problems is a data frame where one of the columns is a character vector instead of a factor, or vice versa. This can cause puzzling issues with analyses or graphs.
When you print out a data frame the normal way, by just typing the name at the prompt and pressing Enter, factor and character columns appear exactly the same. The difference will be revealed only when you run str()
on the data frame, or print out the column by itself:
ToothGrowth
tg <-$supp <- as.character(tg$supp)
tgstr(tg)
#> 'data.frame': 60 obs. of 3 variables:
#> $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
#> $ supp: chr "VC" "VC" "VC" "VC" ...
#> $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
# Print out the columns by themselves
# From old data frame (factor)
$supp
ToothGrowth#> [1] VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC
#> [25] VC VC VC VC VC VC OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ
#> [49] OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ
#> Levels: OJ VC
# From new data frame (character)
$supp
tg#> [1] "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC"
#> [15] "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC"
#> [29] "VC" "VC" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ"
#> [43] "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ"
#> [57] "OJ" "OJ" "OJ" "OJ"