6.3 Making a Density Curve
6.3.2 Solution
Use geom_density()
and map a continuous variable to x (Figure 6.8):
ggplot(faithful, aes(x = waiting)) +
geom_density()
If you don’t like the lines along the side and bottom, you can use geom_line(stat = "density")
(see Figure 6.8, right):
# expand_limits() increases the y range to include the value 0
ggplot(faithful, aes(x = waiting)) +
geom_line(stat = "density") +
expand_limits(y = 0)
6.3.3 Discussion
Like geom_histogram()
, geom_density()
requires just one column from a data frame. For this example, we’ll use the faithful
data set, which contains two columns of data about the Old Faithful geyser: eruptions
, which is the length of each eruption, and waiting
, which is the length of time until the next eruption. We’ll only use the waiting
column in this example:
faithful#> eruptions waiting
#> 1 3.600 79
#> 2 1.800 54
#> 3 3.333 74
#> ...<266 more rows>...
#> 270 4.417 90
#> 271 1.817 46
#> 272 4.467 74
The second method of using geom_line(stat = "density")
tells geom_line()
to use the “density” statistical transformation. This is essentially the same as the first method, using geom_density()
, except the former draws it with a closed polygon.
As with geom_histogram()
, if you just want to get a quick look at data that isn’t in a data frame, you can get the same result by passing in NULL
for the data and giving ggplot a vector of values. This would have the same result as the first solution:
# Store the values in a simple vector
faithful$waiting
w <-
ggplot(NULL, aes(x = w)) +
geom_density()
A kernel density curve is an estimate of the population distribution, based on the sample data. The amount of smoothing depends on the kernel bandwidth: the larger the bandwidth, the more smoothing there is. The bandwidth can be set with the adjust
parameter, which has a default value of 1. Figure 6.9 shows what happens with a smaller and larger value of adjust
:
ggplot(faithful, aes(x = waiting)) +
geom_line(stat = "density") +
geom_line(stat = "density", adjust = .25, colour = "red") +
geom_line(stat = "density", adjust = 2, colour = "blue")
In this example, the x range is automatically set so that it contains the data, but this results in the edge of the curve getting clipped. To show more of the curve, set the x limits (Figure 6.10). We’ll also add an 80% transparent fill, with alpha = .2
:
ggplot(faithful, aes(x = waiting)) +
geom_density(fill = "blue", alpha = .2) +
xlim(35, 105)
# This draws a blue polygon with geom_density(), then adds a line on top
ggplot(faithful, aes(x = waiting)) +
geom_density(fill = "blue", alpha = .2, colour = NA) +
xlim(35, 105) +
geom_line(stat = "density")
If this edge-clipping happens with your data, it might mean that your curve is too smooth. If the curve is much wider than your data, it might not be the best model of your data, or it could be because you have a small data set.
To compare the theoretical and observed distributions of your data, you can overlay the density curve with the histogram. Since the y values for the density curve are small (the area under the curve always sums to 1), it would be barely visible if you overlaid it on a histogram without any transformation. To solve this problem, you can scale down the histogram to match the density curve with the mapping y = ..density..
. Here we’ll add geom_histogram()
first, and then layer geom_density()
on top (Figure 6.11):
ggplot(faithful, aes(x = waiting, y = ..density..)) +
geom_histogram(fill = "cornsilk", colour = "grey60", size = .2) +
geom_density() +
xlim(35, 105)
#> Warning: Removed 1 row containing missing values or values outside the scale range
#> (`geom_bar()`).
6.3.4 See Also
See Recipe 6.9 for information on violin plots, which are another way of representing density curves and may be more appropriate for comparing multiple distributions.