Maps are a powerful and appealing way of visualizing data. A choropleth map shows geographical regions colored, shaded, or graded according to some variable. They are visually striking, especially when the spatial units of the map are familiar entities, like the countries of the European Union, or states in the United States. But maps like this can also sometimes be misleading. Although it is not a dedicated Geographical Information System (GIS), spatial analysis in R is well-developed and well-integrated with the tools we have been working with. R can work with geographical data and ggplot can make many kinds of maps. In this chapter we’ll learn how to make some of them, but also consider some other ways of representing data of this kind.
Figure 7.1: 2024 U.S. presidential election results shown as maps of different kinds. All were drawn with ggplot.
Figure 7.1 shows a series of maps of the U.S. presidential election results in 2024, where the blue and red colors correspond to the Democratic and Republican parties. All of them were made with ggplot. Reading from the top left, we see, first, a state-level map where the margin of within each state is represented on a gradient with a neutral gray midpoint. Second, we see a county-level map with only two values, red or blue, depending on the winner. Third is a county-level map where the color of red and blue counties is graded by the size of the vote share. While the state-level graph in the top-left is continuous, this one has binned the data into a relatively large number of categories. Fourth is a county-level map using the same data as the third one, and a color scheme that also runs from blue to red, but that passes through a purple midpoint (purple is a mixture of blue and red) for areas where the balance of the vote is close to even. The map in the bottom left attempts to correct for one of the main shortcomings of choropleth maps by plotting the vote for each county as a circle whose size is proportional to the number of people who live there. But this creates other difficulties, as the circles overlap, particularly in the eastern side of the map. Finally in the bottom right we see a cartogram. Here, each hexagonal tile corresponds to an electoral college vote. There are five hundred and thirty eight votes in total, with each state being allocated a number of votes proportional to its population. The hexagons are arranged to roughly evoke the spatial arrangement of the states.
Each of these maps shows data for the same event, but the impressions they convey are very different. Each faces two main problems. First, the underlying quantities of interest are only partly spatial. The number of electoral college votes won and the share of votes cast within a state or county are expressed in spatial terms, but ultimately it is the number of people within those regions that matter. Second, the regions themselves are of wildly differing sizes, and they differ in a way that is not well correlated with the magnitudes of the underlying votes. When making maps like this we also face choices that would arise in many other representations of the data. Do we want to just show who won each state in absolute terms (this is all that matters for the actual result, in the end) or do we want to indicate how close the race was? Do we want to display the results at some finer level of resolution than is relevant to the outcome, such as the county-level, but which may be of interest to us for some other reason? How can we show that different data points can carry very different weights because they represent vastly larger or smaller numbers of people? It is tricky enough to convey these choices honestly with different colors and shape sizes on a simple scatterplot. Often a map is like a weird grid that you are forced to conform to even though you know it systematically misrepresents what you want to show.
This is not always the case, of course. Sometimes our data really is purely spatial, and we can observe it at a fine enough level of detail that we can represent spatial distributions honestly and in a compelling way. But the spatial features of much social science are collected through entities such as precincts, neighborhoods, metro areas, census tracts, counties, states, and nations. These may themselves be socially contingent, or fundamentally “not real” with respect to what we are interested in showing. Even if we start with “purely” spatial data, such as an array of points representing people or things at specific positions in space, this data will look different—indeed, arbitrarily different—depending on how we choose to aggregate it. A great deal of cartographic work with social-scientific variables involves working both with and against that arbitrariness. Geographers call this the Modifiable Areal Unit Problem, or MAUP (Openshaw 1984; Unwin 1996).
In the next two sections, we will draw some maps of U.S. states and then fill them with data for the presidential election of 2024. First, we will draw a map in a basic way that we will not pursue further beyond showing that it’s possible. This proof-of-concept approach will use geom_polygon(), a geom made for drawing any kind of closed shape, in conjunction with explicitly transforming the coordinate system for the x and y axes. We’re going to do this to get across the idea that a map is just points, lines, and shapes drawn on some plane and joined-up in the right order. In that sense it is not fundamentally different from any other graph we have drawn, and the same “flow of action” that we have been working with is going to apply to it. Once we have that in mind, we can introduce a new package, Simple Features, that extends tibbles in a way that lets us store spatial information in a tidy manner. Then we will draw our maps with a new geom, geom_sf(), that understands these extensions and automatically manages things like map geometry and coordinate projections.
Draw a Map of U.S. States
Let’s examine a table of data summarizing the 2024 U.S. presidential election. The election24 dataset has various measures of the vote and vote shares by state. Here we pick some columns and sample a few rows at random.
# A tibble: 5 × 6
state fips total_vote r_points party census
<chr> <chr> <dbl> <dbl> <chr> <chr>
1 West Virginia 54 762582 41.9 Republican South
2 Texas 48 11388674 13.7 Republican South
3 Kentucky 21 2074530 30.5 Republican South
4 North Carolina 37 5699141 3.21 Republican South
5 Delaware 10 512912 -14.7 Democratic South
The FIPS code is a federal standard that numbers administrative areas of the United States. States and territories have two-digit FIPS codes. It extends to the county level with an additional four digits, so every U.S. county has a unique six-digit identifier, where the first two digits represent the state. This dataset also contains the census region of each state.
Figure 7.2: 2024 Election Results. Would a two-color choropleth map be more informative than this, or less?
The first thing you should remember about apparently spatial data is that you don’t have to represent it spatially. We’ve been working with country-level data throughout this book and have yet to make a map of it. So we can start with a state-level dotplot, faceted by region (Figure 7.2).
# Hex color codes for parties (we set a name attribute, too).party_colors <-c("Democratic"="#2E74C0","Republican"="#CB454A")election24 |>filter(st %nin%"DC") |>ggplot(mapping =aes(x = r_points,y =reorder(state, r_points), color = party)) +geom_vline(xintercept =0, color ="gray60") +geom_point(size =2) +scale_color_manual(values = party_colors) +scale_x_continuous(breaks =c(-30, -20, -10, 0, 10, 20, 30, 40),labels =c("30\n (Harris)", "20", "10", "0","10", "20", "30", "40\n(Trump)")) +facet_wrap(~reorder(census, r_points),ncol =1,scales ="free_y",space ="free_y" ) +guides(color ="none") +labs(x ="Point Margin", y ="") +theme(axis.text =element_text(size =10))
This figure brings together many aspects of plot construction that we have worked on so far, including subsetting data, reordering both the y-axis and the panels by a second variable, and using a scale formatter. It also introduces some new options, like manually setting the color of an aesthetic. We also have a new argument to facet_wrap(). Previously we saw scales = "free_y" as a way to manage faceting by categorical variables with different numbers of rows. Here, space = "free_y" allows the vertical heights of the facets to vary with the number of rows, so we don’t end up with some panels looking too sparsely-populated.
We write the code for Figure 7.2 “all in one breath”, beginning from the dataset but, as always, you should try stepping through the code a line at a time to see how each function changes the plot. Experiment with the function arguments, too. What happens if you remove the scales="free_y" argument to facet_wrap()? What happens if you delete the call to scale_color_manual()?
To draw a map of this, we need some data on the location and shape of U.S. states. There’s a function in ggplot that lets us extract some from R’s maps package. It comes out as a data frame but we convert it to a tibble:
This table has more than 15,000 rows. Each one is describes a point with a longitude and latitude coordinate. If we want, we can draw these with geom_point(). The (horrible) result is shown in Figure 7.3.
These data are for the continental U.S. only; Alaska and Hawaii are not included. We will remedy this shortly.
us_states |>ggplot(mapping =aes(x = long, y = lat)) +geom_point(size =0.1)
Figure 7.3: A map of the continental U.S. made out of points.
With geom_polygon() we can use this data to draw filled shapes. Like the example we saw in Section 4.2, this geom needs to know the order in which to draw lines and when to “lift the pen”. The us_states object has an order column, but the tibble is already sorted by that. The groupings are given by the group column. We write this to produce Figure 7.4:
us_states |>ggplot(mapping =aes(x = long, y = lat,group = group)) +geom_polygon(fill ="white", color ="black")
Figure 7.4: A map of U.S. state borders.
A map is, in the end, just a set of lines drawn in the right order on a plane. We can map the fill aesthetic to region, set color to a light gray, and thin the lines (via linewidth) to make the state borders a little nicer. We’ll also tell R not to plot a legend, because otherwise we would get a key with fifty entries in it. Finally, by default the map plots latitude and longitude positions right onto the x and y axes. It doesn’t look that good. We need to account for the fact that the earth is round (currently fashionable social media claims notwithstanding). If you look again at the maps in Figure 7.1, you’ll see they are nicer. Look, for example, at the way that the U.S.-Canadian border is a little curved along the 49th parallel from Washington state to Minnesota, rather than a straight line.
Maps take locations defined on a more or less spherical object and draw them on a flat surface, which means we must have a method for transforming or projecting our points and lines from one to the other. The many techniques for doing this are a fascinating world of their own, but for now it’s enough to know we can transform our plot via the coord_map() function. While coordinate systems are a necessary part of the plotting process for any data, we have not had to specify a coord_ function for any graphs so far. Behind the scenes, ggplot used a coord_cartesian() function to draw our x-y plane. When mapping areas on the globe, we must explicitly choose a projection. For the continental United States, the conventional choice is a variant of the Albers projection.
us_states |>ggplot(mapping =aes(x = long, y = lat, group = group,fill = region)) +geom_polygon(color ="gray90", linewidth =0.1) +coord_map(projection ="albers", lat0 =39, lat1 =45) +guides(fill ="none")
Figure 7.5: Adding a fill and improving the projection.
Albers requires two latitude parameters, which the function names lat0 and lat1. We give them their conventional values for a U.S. map here (Figure 7.5). The result is much better than what we had before. Incidentally, the curvature that has been introduced in the axis grid lines (compare them to the right-angled grid in Figure 7.4) gives us a nice intuitive sense of what a map projection involves. Now we just need to get our own data onto the map, instead of just using the region column as the fill.
On maps, these grid lines are called graticules.
Use Simple Features to Draw Maps
At this point we will change gears. We now have a good sense that a ggplot map should be a geom like any other, capable of taking aesthetics like fill and color, being layered with other geoms, and adjusted with scale and guide functions. But it should also be clear from looking at the us_states tibble that a detailed map is going to involve quite a lot of data. In the case of our state-level map, we already have a tibble that has more than 15,500 rows in it, each one encoding a single point. Drawing shapes with this data involves joining up all those lines in the right order. Moreover, consider our tibble of state-level election data again:
# A tibble: 51 × 6
state st fips total_vote winner pct_margin
<chr> <chr> <chr> <dbl> <chr> <dbl>
1 Alabama AL 01 2265090 Trump 0.305
2 Alaska AK 02 338177 Trump 0.131
3 Arizona AZ 04 3390161 Trump 0.0553
4 Arkansas AR 05 1182676 Trump 0.306
5 California CA 06 15865475 Harris -0.201
6 Colorado CO 08 3192745 Harris -0.110
7 Connecticut CT 09 1759010 Harris -0.145
8 District of Columbia DC 11 325869 Harris -0.838
9 Delaware DE 10 512912 Harris -0.147
10 Florida FL 12 10893752 Trump 0.131
# ℹ 41 more rows
It only has fifty one rows (one for each state, plus the District of Columbia). To use it with the code we wrote above we would have to merge it with or join it to the big us_states table of points. This would result in a tibble of some 15,000 rows with each of the election24 rows repeated hundreds of times for each point associated with a single state. This is not an efficient way to work. What we are looking for, in effect, is something roughly equivalent to the list columns that we created in Chapter 6 when we put a linear model in each row of our nested continent-year table. There, the <lm> entry in each row was a full object in its own right, but with all the pieces compactly stored (from our point of view) in a single cell of a table row. We should do the same for our maps. That is, we should have a way to store the points, lines, or shapes corresponding to some unit of observation, like a state, in a single cell in a tibble. That way we would have a 51-row tibble of election data where one column contained the map geometry for each state. This is what the simple features package lets us do (Pebesma 2016; Pebesma and Bivand 2023). We load it like this:
library(sf)
Linking to GEOS 3.13.0, GDAL 3.5.3, PROJ 9.5.1; sf_use_s2()
is TRUE
When we do, we get a message about various behind-the-scenes libraries (GEOS, GDAL, and PROJ) that the package itself is relying on. This message gives a small hint about the tremendous amount of collaborative work involved in setting open standards for spatial data. This work goes far beyond ggplot and the R language. The sf package allows us to smoothly take advantage of it. The socviz package contains a simple features object with geometries for U.S. states. We will use it instead of us_states. It’s called states_sf:
The Simple Features standard is ISO 19125-1:2004.
states_sf
Simple feature collection with 51 features and 4 fields
Geometry type: GEOMETRY
Dimension: XY
Bounding box: xmin: -2500000 ymin: -1678180 xmax: 2258150 ymax: 1558940
Projected CRS: NAD83 / Conus Albers
# A tibble: 51 × 5
fips st state census geometry
* <chr> <chr> <chr> <chr> <GEOMETRY [m]>
1 48 TX Texas South POLYGON ((-997956 -562335…
2 06 CA California West MULTIPOLYGON (((-2055956 …
3 21 KY Kentucky South POLYGON ((577745 -84797.2…
4 13 GA Georgia South POLYGON ((951008 -229093,…
5 55 WI Wisconsin Midwest MULTIPOLYGON (((715684 92…
6 41 OR Oregon West POLYGON ((-2276699 954231…
7 29 MO Missouri Midwest POLYGON ((39182.8 345516,…
8 51 VA Virginia South MULTIPOLYGON (((1742702 1…
9 47 TN Tennessee South POLYGON ((518502 -259828,…
10 22 LA Louisiana South POLYGON ((182195 -524215,…
# ℹ 41 more rows
There are two differences between this object and a regular tibble. The first is the special geometry column. As the name of the column suggests, it has the geometric information needed to draw the shape of each state. In this case, those are x and y coordinates. Some of the rows are prefaced with POLYGON and others with MULTIPOLYGON. This is because some of the states are drawn as a single shape while others have more than one piece. In the type indicator, <GEOMETRY [m]> the [m] signifies that column’s values are denominated in meters. The simple features standard defines different features, corresponding to physical objects in the world. Its details and extensions need not concern us here, but the core features are points, linestrings, and polygons. Points are individual locations in space. Linestrings are sequences of points connected by straight, non-intersecting line segments. Polygons are lines that form closed shapes that do not intersect with themselves. Each of these has a “multi-” variant, i.e. multipoint, multilinestring, and multipolygon. These are used for features built from collections of the corresponding single type. The most important property of a geometry is its location, which in this case is what the numbers we see in the geometry column define: they draw its shape at a defined position in the world.
The standard also defines more complex features that we will not address here. There is also a geometrycollection feature, which is a container for heterogeneous combinations of types
The second difference from ordinary tibbles is seen in the metadata printed when we type the object’s name at the console. The first line tells you how many rows and columns are in the object, with “features” meaning rows and “fields” meaning columns. Then there are some lines about the geometry type (e.g. in a simpler case it might consist of just points). The Bounding Box defines a quadrilateral on the surface of the earth giving the bounds of the map. Finally, the Projected CRS tells us what coordinate reference system and map projection is being used. There are thousands of these. Our introductory use of simple features objects will just scratch the surface of what’s possible with them.
You can inspect CRS options at https://epsg.io. In the CRS information line for states_sf, the “Conus” in “Conus Albers” means “Continental United States”. This particular map has also had Alaska and Hawaii moved to the lower left area so we can include those states without having to plot their actual locations and relative sizes. This was done with the useful shift_geometry() function from the tigris package.
To make our map we will merge, or join, the election24 tibble to the states_sf tibble. To do this we make sure there is an identifying column (a key) shared between the two tables. These columns do not have to have the same name, but it is easier if they do. In this case our two tibbles have fips, state, st and census in common. We could use any combination that uniquely identifies a match between rows. In general, using a well-defined code like fips is the safest option. This is because state might be, for example, capitalized in one tibble but all-caps in the other, or have “District of Columbia” in one but “DC” in the other. In this case, we will name all the columns shared between the tibbles because, if we did not, we would get duplicates of them named e.g. state.x and state.y. Our tables have the same number of rows, so a join should be a one-to-one match. We will use dplyr’s left_join() function, and think of states_sf as being on the left and elections24 as being joined to it. In a left join, all the observations in the left-side table are always kept.
Joins are conservative in this respect. They do not assume two columns not named as keys are the same just because they share a name.
Simple feature collection with 51 features and 20 fields
Geometry type: GEOMETRY
Dimension: XY
Bounding box: xmin: -2500000 ymin: -1678180 xmax: 2258150 ymax: 1558940
Projected CRS: NAD83 / Conus Albers
# A tibble: 51 × 21
fips st state census geometry
<chr> <chr> <chr> <chr> <GEOMETRY [m]>
1 48 TX Texas South POLYGON ((-997956 -562335…
2 06 CA California West MULTIPOLYGON (((-2055956 …
3 21 KY Kentucky South POLYGON ((577745 -84797.2…
4 13 GA Georgia South POLYGON ((951008 -229093,…
5 55 WI Wisconsin Midwest MULTIPOLYGON (((715684 92…
6 41 OR Oregon West POLYGON ((-2276699 954231…
7 29 MO Missouri Midwest POLYGON ((39182.8 345516,…
8 51 VA Virginia South MULTIPOLYGON (((1742702 1…
9 47 TN Tennessee South POLYGON ((518502 -259828,…
10 22 LA Louisiana South POLYGON ((182195 -524215,…
# ℹ 41 more rows
# ℹ 16 more variables: total_vote <dbl>, vote_margin <dbl>,
# winner <chr>, party <chr>, pct_margin <dbl>,
# r_points <dbl>, d_points <dbl>, pct_harris <dbl>,
# pct_trump <dbl>, pct_other <dbl>, harris_vote <dbl>,
# trump_vote <dbl>, other_vote <dbl>, ev_dem <dbl>,
# ev_rep <dbl>, ev_other <dbl>
Now we can draw a map. We do not need to specify x or y coordinates in our mapping = aes() call. All of that information is in elections24_sf because it’s a simple features object, which geom_sf() understands and takes care of plotting properly. For the same reason, we don’t need to provide any explicit information about what coordinate system to use. We need only say what we want the fill to be.
To complete the map (Figure 7.7), we will use our party colors for the fill, move the legend to the bottom, and add a title. We will also remove the latitude and longitude graticules and axis labels, which aren’t needed, by defining a special theme for maps that drops the visual elements we don’t need. (We’ll learn more about themes in Chapter 8.)
Figure 7.7: The 2024 presidential election by state, with political party colors.
With the map data frame in place, we can map other variables if we like. Let’s try a continuous measure, such as the percentage of the vote received by Donald Trump. To begin with (Figure 7.8, upper panel), we will assign the variable we want (pct_trump) to the fill aesthetic, and see what ggplot does by default.
Figure 7.8: Two versions of Percent Trump by State.
Figure 7.9: Two versions of Percent Trump vs Harris share: a white midpoint, and a first ‘Purple America’ version.
The default color used for gradients, shown in the panel, is blue. Because of party color conventions in the US, that isn’t what is wanted here. In addition, the gradient runs in the “wrong” direction. In our case, the convention is to say higher vote share makes for a darker color (a “redder” state). We fix both of these problems in the lower panel of Figure 7.8 by specifying the scale directly. We’ll use the values we created earlier in party_colors.
Let’s say we want to show the comparative share for both candidates at once. In that case, we might prefer a gradient that diverges from a midpoint where both candidates have the same share. The scale_gradient2() function gives us a blue-red spectrum that passes through white by default. Alternatively, we can respecify the midlevel color along with the high and low colors. We will make purple our midpoint and use the muted() function from the scales library to tone down the saturation a little.
Figure 7.10: A Purple America version of Trump vs Harris that excludes results from the District of Columbia.
If you look at the gradient scale for this first “purple America” map, in Figure 7.9, you’ll see that it extends very high on the blue side–much higher, in fact, than any state we can see on the map. This is because the District of Columbia is included in the data, and hence the scale. Even though it is barely visible on the map, DC has by far the highest points margin in favor of the Democratic Party ticket of any observation in the data. If we omit it, we’ll see that our scale shifts in a way that does not just affect the top of the blue end but recenters the whole gradient and makes the red side more vivid as a result. Figure 7.10 shows the result.
This brings out one the familiar problems of having geographical areas that vary widely in size. In this case, we’re showing the margins for each area on a map, but one of the areas is too small to see at the size we are working with. Its values will, however, be picked up by the scale in a way that materially affects the map. Even worse, this sort of problem is often surprisingly difficult to catch. It is yet another reason to check our data processing pipeline and consider carefully how we want to represent its results.
America’s Ur-choropleths
In the U.S. case, administrative areas vary widely in geographical area and also in population size. The modifiable areal unit problem evident at the state level, as we have seen, also arises even more acutely at the county level. County-level U.S. maps can be aesthetically pleasing because of the added detail they bring to a national map. But they also make it easy to present a geographical distribution to insinuate an explanation. The results can be tricky to work with. When producing county maps, it is important to remember that the states of New Hampshire, Rhode Island, Massachusetts, and Connecticut are all smaller in area than any of the ten largest western counties. Many of those counties have fewer than a hundred thousand people living in them. Some have fewer than ten thousand inhabitants.
The result is that most choropleth maps of the United States, for whatever variable, in effect show population density more than anything else. The other big dimension that shows up in the U.S. case is percent Black. Let’s see how to draw these two maps in R. The socviz package provides a simple features object called counties_sf that contains US county-level geometries and few demographic measures for each county. We can select a few columns and then ten rows at random:
Simple feature collection with 10 features and 10 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: -870000 ymin: -720000 xmax: 1600000 ymax: 1e+06
Projected CRS: NAD83 / Conus Albers
fips name area_sqmi white black asian nh_white hispanic pop
1 48099 Coryell County, Texas 1057.9 51794 11608 1664 46233 16992 83772
2 35005 Chaves County, New Mexico 6078.1 41179 964 849 23882 37329 64446
3 27155 Traverse County, Minnesota 584.5 2852 16 5 2783 184 3279
4 17201 Winnebago County, Illinois 518.3 195682 37802 8079 183505 42452 283289
5 28063 Jefferson County, Mississippi 527.8 942 5789 57 942 310 7127
6 18151 Steuben County, Indiana 322.1 32297 266 64 31848 1488 34648
7 51595 Emporia city, Virginia 6.9 1161 3607 33 1102 473 5633
8 21149 McLean County, Kentucky 256.1 8741 69 17 8640 186 9127
9 27035 Crow Wing County, Minnesota 1153.7 63196 393 292 62788 1024 67113
10 48503 Young County, Texas 931.2 14942 311 49 13584 3621 17963
pop_dens_disc geometry
1 50-100 POLYGON ((-166788 -649938, ...
2 10-50 POLYGON ((-862427 -446416, ...
3 0-10 POLYGON ((-43769 943015, -4...
4 500-1,000 POLYGON ((541425 578975, 53...
5 10-50 POLYGON ((447415 -618316, 4...
6 100-500 POLYGON ((920348 531448, 90...
7 500-1,000 POLYGON ((1621858 68281, 16...
8 10-50 POLYGON ((742291 49772, 742...
9 50-100 POLYGON ((128098 982003, 12...
10 10-50 POLYGON ((-223654 -456414, ...
The initial line in the output above says that we have a simple features collection with 10 features (i.e. rows) and 8 fields, but this is just because the print method is applied after we have selected some columns and sampled ten rows. In fact, the counties_sf object has 3,144 rows. The metadata shows that the geometry column consists of polygons of dimension xy, and that the CRS is once again an Albers projection appropriate to the United States. The NAD83 part refers to the North American Datum of 1983, which acts as a conventional reference point for the rest of the coordinate system. Different reference points are used for different sets of maps and are chosen based on a combination of convenience and measurement accuracy. There is, after all, no absolute frame of reference, so we have to pick something to get started. We can try mapping population density per square mile like this (Figure 7.11):
NAD 83 replaced NAD 27, whose geodetic datum was a survey marker in the middle of Osborne county, Kansas, called the Meades Ranch Triangulation Station. NAD 83 moved the reference to a point in the Earth’s core. NAD 83 will itself soon be superseded by a new reference datum.
counties_sf |>ggplot(mapping =aes(fill = pop_dens_disc)) +geom_sf(linewidth =0.01, color ="gray80") +theme_map()
Figure 7.11: US population density by county, with an (inappropriate) unordered color key.
While this produces a legible map, the default key is designed for an unordered categorical variable. Our density categories run from low to high. We could recode pop_dens_disc with the factor() function so that R is aware of the ordering. But seeing as the variable is already sorted in the right order, we can just supply a different palette. There are a few different ways to do this. The easiest is to use one of the palettes available via scale_fill_discrete(). We will also tweak how the legend is drawn using the guides() function to make the key appears on a single row.
We will learn more about scale and guide functions in the next chapter.
counties_sf |>ggplot(mapping =aes(fill = pop_dens_disc)) +geom_sf(linewidth =0.05, color ="black") +scale_fill_discrete(palette ="Greys") +guides(fill =guide_legend(nrow =1)) +labs(fill ="People per square mile") +theme_map()
Figure 7.12: US population density by county.
Although we have a wide range of color palettes available to us, for this map I’ve chosen to draw the fill gradient in shades of gray. As we’ll see in the next chapter, color is often quite tricky to work with properly. While we have many well-calibrated and carefully-designed color palettes available to us in various hues, don’t underestimate how effective the humble grayscale can be. People are good at interpreting it (compared to the complications color can introduce), it holds up well in many forms of media and across presentation environments, and you can do more than you might think with it as a design tool.
The palette is named “Greys”. If you spell it “Grays” you will get a reverse-coded grayscale palette, where low values are shaded darker.
Next, we can draw a corresponding map of percent Black population by county (Figure 7.13). Once again, we specify a palette for the fill mapping using scale_fill_brewer(), this time choosing a different range of hues for the map.
Figure 7.13: US Percent Black population by county.
Figures 7.12 and 7.13 are America’s “ur-choropleths”. Between the two of them, population density and percent Black will do a lot to obliterate many a suggestively-patterned map of the United States. These two measures aren’t explanations of anything in isolation, but if it turns out that it is more useful to know one or both of them instead of the thing you’re plotting, you probably want to reconsider your theory.
As a further example of the problem in action, let’s draw two new county-level choropleths (). The first is an effort to replicate a poorly sourced but widely circulated county map of firearm-related suicide rates in the United States. The su_gun6 variable in counties_sf is a measure of the average rate of all firearm-related suicides between 1999 and 2015. The rates are binned into six categories. We have a pop_dens6 variable that divides the county population density in 2014 into six categories, too.
We first draw a map with the su_gun6 variable. We will match the color palettes between the maps, but for the population map we will flip our color scale around so that less populated areas are shown in a darker shade. We do this by using a function from the RColorBrewer package to manually create two palettes. The rev() function used here reverses the order of a vector.
orange_pal <- RColorBrewer::brewer.pal(n =6, name ="Oranges")orange_pal
The brewer.pal() function produces evenly-spaced color schemes to order from any one of several named palettes. The colors are specified in hexadecimal format. Again, we will learn more about color specifications and how to manipulate palettes for mapped variables in Chapter 8. To draw the gun plot (the left panel of Figure 7.14), we write this code:
It’s clear that the two maps are not identical. However, the visual impact of the first has a lot in common with the second. The dark bands in the West (except for California) stand out, and fade as we move toward the center of the country. There are some strong similarities elsewhere on the map too, such as in the Northeast.
The keys on these plots show the default labels you get when you use the cut() function, which slices a continuous variable in to a chosen number of categories. This uses brackets and parentheses to designate closed and open intervals. The convention is that a square bracket means the endpoint is included, and a parenthesis means the endpoint is excluded. So [4,7) means “From four up to but not including seven.”
The gun-related suicide measure is already expressed as a rate. It is the number of qualifying deaths in a county, divided by that county’s population. Normally, we standardize in this way in an effort to “control for” the fact that larger populations will tend to produce more gun-related suicides just because they have more people in them. However, this sort of standardization has its limits. In particular, when the event of interest is not very common, and there is very wide variation in the base size of the units, then the denominator (e.g., the population size) starts to be expressed more and more in the standardized measure.
Figure 7.14: Gun-related suicides by county; Reverse-coded population density by county. Before sharing this picture on social media, please read the text for discussion of what is wrong with it.
Third, and more subtly, the data is subject to reporting constraints connected to population size. If there are fewer than ten events per year for a cause of death, the Centers for Disease Control (CDC) will not report them at the county level because it might be possible to identify particular deceased individuals. Assigning data like this to bins creates a threshold problem for choropleth maps. Look again at Figure 7.14. The gun-related suicides panel seems to show a north-south band of counties with the lowest rate of suicides running from the Dakotas down through Nebraska, Kansas, and into West Texas. Oddly, this band borders counties in the West with the very highest rates, from New Mexico on up. But from the density map we can see that many counties in both these regions have very low population densities. Are they really that different in their gun-related suicide rates?
Probably not. More likely, we are seeing an artifact arising from how the data is coded. For example, imagine a county with 100,000 inhabitants that experiences nine gun-related suicides in a year. The CDC will not report this number. It will instead be coded as “suppressed,” accompanied by a note saying any standardized estimates or rates will also be unreliable. But if we are determined to make a map where all the counties are colored in, we might be tempted to put any suppressed results into the lowest bin. We know that the number is somewhere between zero and ten. Why not just code it as zero? Meanwhile, a county with 100,000 inhabitants that experiences twelve gun-related suicides a year will be numerically reported. While the CDC provides the absolute number of deaths for all counties above the threshold, the notes to the data file will warn you that any rate calculated with this number will be unreliable. If we do it anyway, then twelve deaths in a small population might well put a sparsely populated county in the highest category of suicide rate. Meanwhile, a low-population county just under that threshold would be coded as being in the lowest (lightest) bin. But in reality they might not be so different, and in any case efforts to quantify that difference will be unreliable. If estimates for these counties cannot be obtained directly or estimated with a good model, then it is better to drop those cases as missing, even at the cost of your beautiful map, than have large areas of the country painted with a color derived from an unreliable number.
Do not do this. One standard alternative is to estimate the suppressed observations using a count model. An approach like this might naturally lead to more extensive, properly spatial modeling of the data.
Finally, these maps also show an aspect of data analysis that introductions like this might be tempted to airily gloss over. Why is all the data for Connecticut missing? In the first edition of this book, we had to construct our maps in a slightly laborious fashion because the sf package was not yet fully up and running. Now we use an sf object whose geometries are from U.S. Census Bureau resources current at the time of writing (early 2026). But the data shown in Figure 7.14 is still from 2014. In between, Connecticut reorganized its administrative boundaries and replaced its older counties with new planning areas. These have new FIPS codes. This means that earlier county-level data for Connecticut can’t be directly joined to current U.S. county geometries. I could have silently corrected this by updating the gun-suicide data, or included a separate sf object with 2014-era county geometry in the socviz package. But I have chosen to leave it as-is, because this sort of thing happens all the time. Units of observation (like states and counties) can change their meaning and extent over time, or be observed at inconsistent or incompatible scales and resolutions. Identifying and harmonizing the right data for the units of interest can be a frustrating experience.
Sometimes it’s possible to construct a way of getting from the old unit of observation to the new one, either directly (if there is some lower-level unit that’s shared between the old and new systems) or indirectly (e.g. by weighting data from the old units to distribute it into the new ones). This is called a “crosswalk”.There’s a reason Geographical Information Systems is a specialized area of work.
Small differences in reporting, coarse binning and miscoding, and unexpected changes in the units of analysis can produce spatially misleading, substantively mistaken, or incomplete results. It might seem that focusing on the details of variable coding in this particular case is a little too much in the weeds for a general introduction. But it is exactly these details that can dramatically alter the appearance of any graph, and especially maps, in a way that can be hard to detect after the fact. In a book focused on demonstrating how visualizations are actually produced, rather than just giving general recommendations and showing fully-polished finished products, it’s more honest to acknowledge these difficulties directly rather than glossing over them.
Statebins
As an alternative to state-level choropleths, we can consider making cartograms. A cartogram splits the difference between the spatial detail of a map and the ordinary representation of data on a scatterplot or bar chart by converting the geographical units to some regular shape and making a little effort to preserve, or at least evoke, the spatial relations between the units. For the case of U.S. states, we can use statebins, a package developed by Bob Rudis (Rudis 2014).
Figure 7.15: Cartograms of the state winner (upper) and Trump vote share (lower). The state boxes in the upper panel have rounded corners (the default); in the lower panel they are squares.
Statebins can work either of two ways. It has a standalone statebins() function that is a little like an all-in-one ggplot call, but which has a slightly different syntax from the one we’re used to. Alternatively, the package also provides geom_statebins(), which works more like a regular geom. We will use the latter here to recreate some of the state-level maps we made earlier in this chapter. Here is the code for maps of the state winners and Trump’s vote share (Figure 7.15):
The state argument is required. Our state abbreviations are in the st column of the election24 dataset. The geom will automatically choose whether to color the state abbreviations in a light or dark color based on the color of the state’s square. You can specify what color you’d like the light labels and the dark labels to be with the arguments light_lbl and dark_lbl, respectively. In the first plot we set the size of the labels using the lbl_size argument to geom_statebins(). In the second, we also adjust the corner radius of the state boxes. By default they are slightly rounded. Writing radius = unit(0, "pt") makes them squares. The coord_equal() function forces the (implicit) x and y axes of the graphs to have the same relative sizes even if the physical size of the plot does not have equal width and height. Finally, theme_void() is a ggplot theme that switches off all the guides, gridlines, labels, and so on in a plot.
The unit() function is used generally across geoms and other ggplot functions.
Small-Multiple Maps
Sometimes we have geographical data with repeated observations over time. A common case is to have a country- or state-level measure observed over a period of years. In these cases, we might want to make a small-multiple map to show changes over time. For example, the opiates dataset in socviz has state-level measures of the death rate from opiate-related causes (such as heroin or fentanyl overdoses) between 1999 and 2020.
opiates
# A tibble: 1,122 × 8
fips st year deaths crude adjusted region
<chr> <chr> <int> <int> <dbl> <dbl> <ord>
1 01 AL 1999 37 0.8 0.8 South
2 01 AL 2000 43 1 1 South
3 01 AL 2001 57 1.3 1.3 South
4 01 AL 2002 71 1.6 1.6 South
5 01 AL 2003 49 1.1 1.1 South
6 01 AL 2004 83 1.8 1.8 South
7 01 AL 2005 80 1.8 1.8 South
8 01 AL 2006 124 2.7 2.7 South
9 01 AL 2007 165 3.5 3.6 South
10 01 AL 2008 185 3.9 4.1 South
# ℹ 1,112 more rows
# ℹ 1 more variable: division_name <chr>
As before, we can join this data to the states_sf, the simple features object that has our U.S. state geometries in it. The tables share the fips and st columns. Again, we could just join by fips, because it uniquely identifies how to match every row, but we can use more keys to prevent them being unnecessarily duplicated. In addition, the region column in opiates encodes the same variable as the census column in states_sf; they just have different names. When we see a data frame we ask “What is a row?” In the opiates table, each row is a state-year, with information on death rates for some specific state at a particular time. in states_sf, each row is a state. For the join, we can think of opiates as being the big table on the left side of the join and states_sf as being the small table on the right side. The join merges each row of states_sf to the corresponding rows of opiates, repeating them as needed down the years.
Simple feature collection with 1122 features and 9 fields
Geometry type: GEOMETRY
Dimension: XY
Bounding box: xmin: -2500000 ymin: -1680000 xmax: 2260000 ymax: 1560000
Projected CRS: NAD83 / Conus Albers
# A tibble: 1,122 × 10
fips st state census geometry year
<chr> <chr> <chr> <chr> <POLYGON [m]> <int>
1 01 AL Alaba… South ((708114 -594376, 709023… 1999
2 01 AL Alaba… South ((708114 -594376, 709023… 2000
3 01 AL Alaba… South ((708114 -594376, 709023… 2001
4 01 AL Alaba… South ((708114 -594376, 709023… 2002
5 01 AL Alaba… South ((708114 -594376, 709023… 2003
6 01 AL Alaba… South ((708114 -594376, 709023… 2004
7 01 AL Alaba… South ((708114 -594376, 709023… 2005
8 01 AL Alaba… South ((708114 -594376, 709023… 2006
9 01 AL Alaba… South ((708114 -594376, 709023… 2007
10 01 AL Alaba… South ((708114 -594376, 709023… 2008
# ℹ 1,112 more rows
# ℹ 4 more variables: deaths <int>, crude <dbl>,
# adjusted <dbl>, division_name <chr>
The join_by() function defines the columns to merge on. Writing census == region asserts that the census column in states_sf has the same key information as the region column in opiates. You can see from the result that all the states_sf columns have been merged with the opiates data. It all repeats down the rows, including the geometries. While column data is stored fairly efficiently by R, this repetition and redundancy is why we tend to keep tables of data separate and then join them when needed rather than storing everything in a single giant able. The call to arrange() isn’t strictly necessary; it’s just there to keep the data ordered as we would expect.
Row order should never be relied on to merge tables. This is why we join using key columns.
We are now in a position to make a faceted small-multiple with one map for each year in the data. To do it, we will bin the death rate before handing the tibble to ggplot. To make the number of panels in the plot more manageable we will filter out some of the earlier years in the dataset. For our color palette we will use a new scale function that implements the viridis color schemes. Viridis is a collection of palettes that do a very good job of combining perceptually uniform colors with easy-to-see, easily-contrasted hues along their scales (Garnier 2015). Some balanced palettes can be a little washed out at their lower end, especially, but the viridis palettes avoid this. In this code, the _d suffix in the scale_fill_viridis_d() function signals that it is the scale for discrete data. There is a scale_fill_viridis_c() equivalent for continuous data.
We facet the maps with facet_wrap(), just like any other small-multiple. One of the most useful features of ggplot is the general nature of its framework. A map is just a geom, and we can arrange it like we would any other geom. Here’s the code for Figure 7.16:
Before discussing whether this is a good way to visualize this data, we can highlight a few details in what we have written here. We’ll explore more of these tricks to polish plots shortly. In the mutate call, we use cut_width() to cut the adjusted death rate up into discrete bins. The width and boundary arguments ensure that the breaks between the bins will be where we want them. This is worth thinking about in its own right. Discretizing continuous data like this is something we do a lot, and it has benefits for visualization. It’s easier to distinguish differences in a binned scale than a continuous one. But, by the same token, that means the choices made around binning can matter a great deal. It’s potentially another way to introduce some arbitrariness into the process of data analysis. When we drew histograms back in Chapter 3 we noted that choosing the number of bins or the binwidth would make a substantial difference to how the histogram looked. Exactly the same is true here. A useful exercise is to revisit the code for the ur-choropleths, but use continuous rather than binned measures, as well as the viridis palette. You could also try starting from a continuous measure of density cutting it by various methods and to differing numbers of levels.
When modeling data, discretizing a variable by cutting it into quartiles, deciles, and so on makes it easier to extract apparently meaningful differences across its distribution. The leverage offered by statistical models amplifies this issue. The limit case is needlessly dichotomizing variables. See, e.g., Fedorov et al. (2009); Van Zwet et al. (2026).In addition to base R’s flexible cut() function, dplyr provides cut_width(), which we used here to make groups of equal width, cut_interval(), which makes n groups with equal range, and cut_number(), which makes n groups with approximately equal numbers of observations.
Figure 7.16: A small multiple map.
The st_as_sf() call reasserts that opiates_sf() is a simple features object rather than just a tibble with a column named geometry. In the scale_fill_viridis_d() call, the option argument picks the specific palette we use. The labels argument is there to deal with how missing data is labeled in the legend. Without that argument, the white box at the end of the key will have the label NA, the raw missing value tag. As we saw in Chapter 5, we could also manually specify the labels here. But we would have to name all of them, and perhaps as we revised our graph the labels and breaks might change. Instead, we write an anonymous function that applies a test to the vector of labels that ggplot generates automatically. That vector is the becomes a variable (\(x)) handed to the ifelse() test, which has three parts: a test that yields TRUE or FALSE as an answer, an instruction about what to do if the answer is TRUE and an instruction about what to do if it’s FALSE. Here the function says “Test each element of the label to see if it contains the missing value NA; if it does, replace that element with the phrase ‘No data’ and return that; if it doesn’t, don’t do anything, just return what you were originally handed.”
Here is our “waving person” again, \(x), indicating an anonymous or lambda function.Functions that return TRUE or FALSE, like is.na() or is.integer() are called predicate functions.
The direction argument in scales_viridis_d() can be either 1 (the default) or -1 (as here). By using -1 we reverse the start and end colors of the palette. This is worth experimenting with, not just in the context of mapping but also for any area-based plot. For some kinds of graph, it may seem more natural to have the darker color mean “more” and the lighter color “less”. The analogy would be to an increasing density, or filling in an empty space. But in other cases, it might seem like darker should mean “less” and brighter colors “more”. The analogy in those cases would be more to the intensity of light, like patches of fire on a dark hillside. Balanced color schemes like the viridis palettes can make this choice harder because they often transition between two colors as they vary their luminance. And beyond that, in a manner akin to “duck/rabbit” figure-and-ground effects, people may “naturally” see the opposite from the one you intended.
Is Your Data Really Spatial?
Such questions lead us to ask whether drawing this small-multiple map is a good way to visualize the data in the first place. As we discussed above, county-level choropleth maps of the United States tend to track first the size of the local population and secondarily the percent of the population who are Black. For state level, it is the differences in the geographical size of states that dominates, and faceting only makes this problem worse. The difficulties are like a supercharged version of the small-multiple bar chart we made in Figure 6.9, where we also faceted by year. There it was already quite tricky to compare changes in the same bar between different panels, even though we only had a few categories to compare. Now we have fifty states of widely-varying size. The repeated measures and the visual familiarity of the U.S. map do mean that some comparison is possible. The strong trends for this data make things a little easier to see, too. The increasing rate of deaths involving opioids are easy to see concentrating in the Southwest and, with increasing severity, in the Appalachians centered on West Virginia.
As we noted at the beginning of the chapter, even if our data is collected via or grouped into spatial units, it is always worth asking whether a map is the best way to present it. Much county, state, and national data is not properly spatial, insofar as it is really about individuals (or some other unit of interest) rather than the geographical distribution of those units per se. Let’s try the same thing we did with our small-multiple bar chart and redraw it as a time-series plot. We will keep the state-level focus but try to make the trends more directly visible.
We could just plot the trends for every state, as we did at the very beginning with the gapminder data. But fifty states is too many lines to keep track of at once.
A more informative approach is to take advantage of the geographical structure of the data by using census divisions to group the states. Imagine a faceted plot showing state-level trends within each region of the country, perhaps with a trend line for each region, and endpoint labels identifying the states. To do this, we will take advantage of ggplot’s ability to layer geoms one on top of another, using a different dataset in each case. For the next few chunks of code we’ll return to our earlier method of building up the plot piecemeal, so we can explain what’s happening at each step. The result will be Figure 7.19. We begin by taking the opiates data and creating some small special-purpose tibbles from it. We remove the District of Columbia as we go, as it is not a state.
st_top_div <- opiates |>filter(year ==max(year), st !="DC") |>group_by(division_name) |>slice_max(order_by = adjusted, n =2)opiates20_div <- opiates |>mutate(top = st %in% st_top_div$st) |>filter(year ==2020, st !="DC")labels_div <- opiates |>group_by(division_name) |>summarize(adjusted =max(adjusted, na.rm =TRUE)) |>mutate(facet_var =reorder(division_name, adjusted)) |>drop_na()
What are these objects?
The st_top_div tibble cuts down the opiates dataset to the most recent year, then groups by census division (there are nine of them) and picks the two states within each division that have the highest adjusted death rate in that year. We are going to use this tibble in two ways: first to put a point at the end of our trend line, and second to create a flag in other tibbles that identifies the highest death-rate states.
The opiates20_div tibble has one row per state, with a flag variable, top that is TRUE if a state is one of those in st_top_div and FALSE otherwise. It has all the columns in opiates, but only for 2020. We are going to use this to label the endpoints of all the states.
The labels_div tibble has nine rows, one for each census division, that contains the maximum value of the adjusted death rate within that division. We will use it to label the facets inside the plot area.
Now let’s build the plot.
p0 <- opiates |>filter(st !="DC") |>mutate(top = st %in% st_top_div$st) |>ggplot(aes(x = year, y = adjusted, color = top, group = st)) +geom_line() +geom_point(data = opiates20_div)p0
Figure 7.18: The first step in our layered time series plot.
To make the p0 object we start with opiates, filter out DC, and then create the top flag with mutate(). The call goes down each row of the st column and compares it to the corresponding column in st_top_div, assigning TRUE if there’s a match and FALSE otherwise. Then we set up the ggplot object, putting year on the x-axis, the adjusted death rate on the y-axis, and representing top by color. We also tell ggplot that the opiates data are grouped by state. We add a geom_line() layer and then a geom_point() layer. The geom_point() layer inherits all the same aesthetics as geom_line(), but uses its own dataset. This is the opiates20_div tibble we just created, which only has data for 2020. If we plot this (Figure 7.18), we’ll get a single-panel line graph with a separate line for every state. Each line will have a point at the end, and both will be colored by the top flag.
Next, we add the text labels to the endpoints with geom_repel_label() from the ggrepel package.
In this step, we give geom_label_repel() the opiates20_div tibble as its data and tell it to use the st column as the label. The geom will inherit all the other necessary aesthetic mappings from the p0 object. We then set several options for the geom, mostly to constrain it. We’re using geom_text_repel() in the first place because of its flexibility in placing labels, but we want to stop it from doing some things. The xlim() argument constrains it to only put labels to the right of 2020 on the x-axis. The padding arguments slightly reduce the amount of spacing allowed inside and outside the labels. Finally, “segments” are what ggrepel calls the little lines it draws to link a data point to a label. We want those to be as short as possible and also invisible. So we set min.segment.length to 0 and segment.color to NA. The result, not shown here, is that every state line will be labeled very close to its endpoint, but always to the right of it.
Next, we do something that we could skip if we chose, but which in this case helps the plot enough to be worth it. When we facet a plot, the facet’s labels are shown at the top of each panel. For our plot that will put them quite far away from where the data is in several of the panels. So instead we are going to turn off the strip labels and put a label inside the plot area. We’ll use geom_label() which works the same way that geom_text() does. We tell it to use labels_div for its data but to place every label in the same x (2005) and y (40) location. We also tell it to ignore any other aesthetics it might otherwise inherit, e.g. color which would only confuse it.
If you plot the p2 object you’ll just see one label, because they are all printed right on top of one another.
We’re nearly there. Now we just need to deal with the scales and guides, add the labels, set up the faceting, and make a quick adjustment to the theme.
p2 +scale_x_continuous(expand =expansion(add =c(0, 2))) +scale_color_manual(values =c("gray40", "firebrick")) +guides(color ="none") +labs(x =NULL, y ="Rate per 100,000 population",title ="Opiate Death Rates by State and Census Division, 1999-2014") +facet_wrap(~reorder(division_name, adjusted, max, na.rm =TRUE),nrow =3) +theme(strip.text =element_blank())
The expand argument to scale_x_continuous() creates a little extra room at the right-hand end of the x-axis, to give the state labels a little breathing room. The facet_wrap() call orders the census divisions according to the maximum value of the adjusted death rate in each one. The call to theme() switches off the facet strip labels.
The expansion() function can take both add and multiply parameters to flexibly adjust the scale. This is also how you can force ggplot to make plots start flush with the axes if you wish.
Figure 7.19: The opiate data as a faceted time-series.
The result is shown in Figure 7.19. This plot uses the same data as shown in Figure 7.16 but in quite a different way. It is easier to see what is happening in the geographically smaller states, such as Delaware in the South Atlantic division, or Connecticut in New England. Some things are conveyed in a similar way in both representations, such as the astonishingly rapid rise in West Virginia’s death rate. A time-series plot like this is also better at conveying the diverging trajectories of various states. While that can be inferred from the maps, it is easier to see in the trend plots. At the same time, we also give up some things looking the data this way. The Census Bureau’s allocation of states to divisions tries to preserve geographical relatedness but is also somewhat arbitrary. West Virginia, Kentucky, and Ohio are placed in different boxes, for example. But these are amongst the most affected states, and they share borders with one another. This gets lost in the time series plot. Having to faceting the plot by census division is, in a way, a concession to the costs imposed by trying to visualize fifty lines while keeping them identifiable. We can see that in the labeling. The fact that many rates are quite close makes legibly identifying each state a little tricky. If we were placing these labels by hand we would be able to do better, but our goal has been to see how far we can reasonably get in a reproducible, programmatic fashion.
The data in both figures is the same. Its unit of observation is the state-year. The map emphasizes the “state” aspect, the line-graph the “year” aspect. Beyond that, the limits of any representation are imposed by the unit of observation. In some sense, an “ideal” dataset here would exist at some much more fine-grained level of unit, time, and spatial specificity. Imagine individual-level data with arbitrarily precise information on personal characteristics, time, and location of death. In a case like that, we could then aggregate up to any categorical, spatial, or temporal units we liked. But data like that is rare, often for good reasons that range from practicality of collection to the privacy of individuals. In practice, we need to take care not to commit a kind of fallacy of misplaced concreteness that mistakes the unit of observation for the thing of real substantive or theoretical interest. This is a problem for most kinds of social-scientific data. But their striking visual character, and the fact that spatial units like counties or states do have a substantive social and administrative existence, makes maps perhaps more vulnerable to this problem than other kinds of visualization.
Where to Go Next
In this chapter, we learned how to begin to work with state-level and county-level data. But this barely scratches the surface of visualization where spatial features and distributions are the main focus. The analysis and visualization of spatial data is its own research area, with its own scholarly disciplines in geography and cartography. Concepts and methods for representing spatial features are both well-developed and standardized. Until recently, most of this functionality was accessible only through dedicated Geographical Information Systems. Their mapping and spatial analysis features were not well-connected. Or, at least, they were not conveniently connected to software oriented to the analysis of tabular data.
Things have improved greatly over the past decade. The integration of the Simple Features standard into R has meant that sophisticated spatial analysis is much more accessible now than it once was. The Simple Features package is presented in detail in (Pebesma and Bivand 2023). For an introduction to visualization with R focused on spatial data (and the calculation of spatial statistics) see (Beecham 2026). In the U.S. case, mapping social scientific data depends heavily on the resources provided by the U.S. Census Bureau and the TIGER/Line that allow you to map data for many different geographical, administrative, and Census-related subdivisions of the country. Kyle Walker’s tidycensus package (Walker 2018; Walker and Herman 2017) lets you talk to the Census Bureau’s API in a tidy fashion, including obtaining its shapefiles. The stars and terra packages are gateways to more complex forms of spatial data (Pebesma 2018; Hijmans 2020). Finally, while not a ggplot extension, the tmap package uses a ggplot-like syntax to produce high-quality thematic maps (Tennekes 2014).
Beecham, Roger. 2026. Visualization for Social Data Science. Chapman & Hall/CRC Statistics in the Social and Behavioral Sciences. CRC Press, Taylor & Francis Group. https://doi.org/10.1201/9781003292760.
Fedorov, Valerii, Frank Mannino, and Rongmei Zhang. 2009. “Consequences of Dichotomization.”Pharmaceutical Statistics 8 (1): 50–61. https://doi.org/10.1002/pst.331.
Pebesma, Edzer, and Roger Bivand. 2023. Spatial Data Science: With Applications in R. 1st ed. Chapman and Hall/CRC. https://doi.org/10.1201/9780429459016.
Van Zwet, Erik W., Frank E. Harrell, and Stephen J. Senn. 2026. “An Empirical Assessment of the Cost of Dichotomization of the Outcome of Clinical Trials.”Statistics in Medicine 45 (3–5): e70402. https://doi.org/10.1002/sim.70402.
Walker, Kyle. 2018. Analyzing the US Census with R. CRC Press.
Walker, Kyle, and Matt Herman. 2017. “Tidycensus: Load US Census Boundary and Attribute Data as ’Tidyverse’ and ’Sf’-Ready Data Frames.” Comprehensive R Archive Network, June 20. https://doi.org/10.32614/CRAN.package.tidycensus.