Michael W. Kearney

Three things to know beyond base R

Tue, 23 Apr 2019 00:00:00 +0000

I think it’s fair to say that most academics who learn about R do so in the process of training or applying quantitative research methods. As a consequence, knowledge of R among academics tends to be limited to core (base) R packages (R Core Team, 2018) and a small handful of speciality statistical packages, e.g., {lavaan}, {lme4}, {MASS}, {car}, etc. With this in mind, the goal of this post is to provide an overview of three things to know beyond base R.

A more appropriate title for this post could be, “A quick introduction to the tidyverse,” as all of the following things to know beyond base R come from the {tidyverse}–a collection of high-powered, consistent, and easy-to-use packages developed by a number of thoughtful and talented R developers, so this should really be considered an exceptionally brief introduction to parts of the tidyverse. But users don’t need to know know everything about the tidyverse to reap the benefits of it. However, if you’re interested in a more formal/thorough introduction to the tidyverse, I would strongly encourage you to checkout R for Data Science by Garrett Groleman and Hadley Wickham.

For those of you who still might be hestitant about moving beyond base R, consider the plot below, which shows the download counts of {tidyverse} packages compared to several well-known and highly-regarded statistical packages.

This plot hopefully demonstrates two things. First, that a lot of people use (and therefore test, troubleshoot, and write documentation for) tidyverse packages. Second, that use of tidyverse packages is not merely a fad or momentary trend. Indeed, compared to widely used statistical packages, there is a considerably higher download rate among tidyverse packages (even above and beyond the general uptick in overall R usage).

1. The pipe

The first thing to know beyond base R is the pipe. The pipe refers to the %>% operator from the {magrittr} package.

library(magrittr)

It may seem complicated at first, but what the pipe does is actually quite simple. That is, it allows users to write linear code. To illustrate use of the pipe, consider the following code that takes the mean of the log of three numbers:

mean(log(c(1, 3, 9)))
#> [1] 1.098612

Notice how the numbers c(1, 3, 9) are nested inside log(), which is then nested inside mean()? If you’re reading the code from left-to-right, it means the functions are performed in reverse order from how they are written. If we broke the code down into its three functions, we would actually expect the order of operations to proceed as follows:

Concacenate numbers into vector c(...)
Log the numeric vector log(...)
Estimate the mean of the logged numeric vector mean(...)

With this order in mind, now consider the following piped code, which takes a numeric vector c(1, 3, 9), calculates the log(), and then estimates the mean(). Hopefully you notice that, in contrast to the nested code above, the code below is linear; in other words, the code appears in the same order (moving from left to right) as the operations are performed.

c(1, 3, 9) %>% log() %>% mean()
#> [1] 1.098612

As a convention designed to make piped code even easier to read, users are encouraged to place each piped statement on its own line. So the code above should be rewritten as follows:

c(1, 3, 9) %>%
  log() %>%
  mean()
#> [1] 1.098612

2. The tibble

R is my favorite programming language because nearly everyone who uses it either works with data frames or is extremely familiar with them. With all of the use and attention, it is hardly surprising that there would be some improvements to the traditional data.frame, which is why the tibble is the second thing to know beyond base R. The tibble refers to a data frame-like class produced by the {tibble} package. Tibbles (class tbl_df) are essentially a special variant of data frames that have desirable properties for printing and joining. And because they also inherit the data.frame class, they also behave like data frames 99.9% (and, seriously, I wouldn’t worry about the 0.1%).

As the code illustrates below, it’s easy to convert nearly any data frame into a tibble via tibble::as_tibble():

(mtcars <- tibble::as_tibble(mtcars))
#> # A tibble: 32 x 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> 4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
#> 5  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
#> # … with 27 more rows

It’s also possible to create tibbles directly via tibble::tibble(...), e.g.,

tibble::tibble(
  x = rnorm(100),
  y = rnorm(100),
  z = sample(letters, 100, replace = TRUE)
)
#> # A tibble: 100 x 3
#>       x      y z    
#>   <dbl>  <dbl> <chr>
#> 1 0.138  0.851 u    
#> 2 0.894 -0.295 m    
#> 3 0.377  1.61  u    
#> 4 0.345 -0.332 r    
#> 5 0.508  1.24  m    
#> # … with 95 more rows

You hopefully noticed in the printing of the two previous code chunks that tibbles print out a lot prettier than normal data frames. Each observations is limited to a single line (no horizontal scrolling or wrapping). Not all rows are printed by default. And the printout also includes meta information about the classes of variables and the number of rows and columns and in the data set.

If you were really paying attention, you may have also noticed the z variable in the tibble built from scratch was stored (by default) as a character vector and not a factor. This is another important difference in tibbles compared to data frames. Tibbles are lazy, which is this case is useful for avoiding join or mutate errors later on related to a limited set of observed factor levels.

3. Select, filter, arrange, and mutate

If I could only use one package beyond base R, it’d probably be {dplyr}, which is why key dplyr functions are the third thing to learn beyond base R. Compared to base R, the beauty of these {dplyr} functions is that they feature consistent design principles, easily work with non-standard evaluation (i.e., you don’t have to put quotes around variable names), and even leverage c++ behind the scenes for improved performance.

{dplyr} has tons of useful features, but its fundamental building blocks allow users to select columns,

dplyr::select(mtcars, cyl, wt, mpg, gear)
#> # A tibble: 32 x 4
#>     cyl    wt   mpg  gear
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     6  2.62  21       4
#> 2     6  2.88  21       4
#> 3     4  2.32  22.8     4
#> 4     6  3.22  21.4     3
#> 5     8  3.44  18.7     3
#> # … with 27 more rows

mutate columns (i.e., transform, add),

mtcars %>%
  dplyr::mutate(mpg_per_cyl = mpg / cyl,
    car = row.names(datasets::mtcars)) %>%
  dplyr::select(car, mpg, cyl, mpg_per_cyl)
#> # A tibble: 32 x 4
#>   car                 mpg   cyl mpg_per_cyl
#>   <chr>             <dbl> <dbl>       <dbl>
#> 1 Mazda RX4          21       6        3.5 
#> 2 Mazda RX4 Wag      21       6        3.5 
#> 3 Datsun 710         22.8     4        5.7 
#> 4 Hornet 4 Drive     21.4     6        3.57
#> 5 Hornet Sportabout  18.7     8        2.34
#> # … with 27 more rows

filter rows,

dplyr::filter(mtcars, cyl == 4, mpg >= 10)
#> # A tibble: 11 x 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  22.8     4 108      93  3.85  2.32  18.6     1     1     4     1
#> 2  24.4     4 147.     62  3.69  3.19  20       1     0     4     2
#> 3  22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2
#> 4  32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
#> 5  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2
#> # … with 6 more rows

and arrange rows.

dplyr::arrange(mtcars, dplyr::desc(mpg))
#> # A tibble: 32 x 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1
#> 2  32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
#> 3  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2
#> 4  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2
#> 5  27.3     4  79      66  4.08  1.94  18.9     1     1     4     1
#> # … with 27 more rows

Chain these {dplyr} functions together with the pipe operator %>% to achieve linear/tidyverse-style coding excellence.

mtcars %>%
  dplyr::filter(gear == 4) %>%
  dplyr::mutate(wt_mpg = wt / mpg) %>%
  dplyr::select(cyl, wt, mpg, wt_mpg) %>%
  dplyr::arrange(wt_mpg)
#> # A tibble: 12 x 4
#>     cyl    wt   mpg wt_mpg
#>   <dbl> <dbl> <dbl>  <dbl>
#> 1     4  1.62  30.4 0.0531
#> 2     4  1.84  33.9 0.0541
#> 3     4  2.2   32.4 0.0679
#> 4     4  1.94  27.3 0.0709
#> 5     4  2.32  22.8 0.102 
#> # … with 7 more rows

Faster code with Rcpp

Wed, 10 Apr 2019 00:00:00 +0000

Recently I was asked if I could add to {rtweet} some basic functions for converting Twitter data into network data objects. I thought this was a reasonable request and a good opportunity for me to learn more about network analysis. But the task of converting Twitter data into network-friendly objects is something that has, at least for me, been really slow and inefficient. So, for the past several weeks, I’ve been slowly working toward what I think believe a simple but efficient solution. Hence, the purpose of this blog post is to document what I’ve done.

The problem

The ultimate task at issue is converting Twitter data¹ into a network or network-friendly data object. Thus, the immediate problem is quickly and efficiently unrolling the connections (e.g., mentions) from one user to zero or more other users. In other words, the problem is figure out how to convert this recursive data frame:

#> # A tibble: 163 x 2
#>    user_id mentions_user_id
#>    <chr>   <list>          
#>  1 5685812 <chr [1]>       
#>  2 5685812 <chr [2]>       
#>  3 5685812 <chr [1]>       
#>  4 5685812 <chr [1]>       
#>  5 5685812 <chr [2]>       
#>  6 5685812 <chr [1]>       
#>  7 5685812 <chr [1]>       
#>  8 5685812 <chr [1]>       
#>  9 5685812 <chr [1]>       
#> 10 5685812 <chr [1]>       
#> # … with 153 more rows

into a desired output (with from and to-like columns) that looks something like this:

tibble::as_tibble(unroll_connections2(d))
#> # A tibble: 220 x 2
#>    from    to                 
#>    <chr>   <chr>              
#>  1 5685812 2973406683         
#>  2 5685812 215035672          
#>  3 5685812 1051975721885798402
#>  4 5685812 1015516068717170688
#>  5 5685812 2801252524         
#>  6 5685812 15184835           
#>  7 5685812 260399941          
#>  8 5685812 17581779           
#>  9 5685812 870078805381132288 
#> 10 5685812 4069028055         
#> # … with 210 more rows

Pure R code (slowest)

The first function I wrote to accomplish this task leveraged data.frame logic (each column should be the same length) to coerce the from column (user_id) to be of equal length as the to (mentions_user_id) column for each row of the input data set. It then collapses everything into a single data frame.

unroll_connections1 <- function(.x) {
  fun <- function(from, to) {
    ## if NULL or 1 missing value then return empty tibble
    if (length(to) == 0 || (length(to) == 1 && is.na(to[1]))) {
      return(data.frame())
    }
    ## return as tibble
    data.frame(from = from, to = unlist(to, use.names = FALSE),
      stringsAsFactors = FALSE)
  }
  .x <- mapply(fun, .x[[1]], .x[[2]], USE.NAMES = FALSE)
  do.call(rbind, .x)
}

The above code is slow and inefficient because it calls data.frame() (and all its associated baggage) for every row of the input data.

Pure R code (faster)

My next iteration was also written in pure R code. To minimize the effect of so many data.frame() calls, the function below calculates the number of times it needs to repeat the from value (to match the number of times to values appear) and then combines everything at the end into a data frame. As the benchmarking results later on confirm, this function offers a sizable speed up over the original, data.frame()-heavy function!

unroll_connections2 <- function(x) {
  ## initialize logical (TRUE) vector
  kp <- !logical(nrow(x))

  ## measure [and record] length of each 'to' field (list of character vector)
  n <- lengths(x[[2]])
  n1 <- which(n == 1)

  ## if length == 1 & is.na(x[1])
  kp[n1[vapply(x[[2]][n1], is.na, logical(1))]] <- FALSE

  ## crate 'from' and 'to' vectors
  from <- unlist(mapply(rep, x[[1]][kp], n[kp]), use.names = FALSE)
  to <- unlist(x[[2]][kp], use.names = FALSE)

  ## return as data frame
  data.frame(
    from = from,
    to = to,
    stringsAsFactors = FALSE
  )
}

Rcpp code (fastest)

I was happy with the large speed up from unroll_connections2(), but I’ve also been trying to learn how to speed up my code with c++, so I decided to see what kind of additional speed up I could get via {Rcpp}.

#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::export]]
List unroll_connections3(CharacterVector from, std::vector<std::vector<std::string>> to) {
  //# set size paramaeters (exclude NAs from the 'to'-based output count)
  const int n = from.size();
  int len = 0;
  for (int i = 0; i < n; i++) {
    if (to[i][0] != "NA") {
      len += to[i].size();
    }
  }
  //# use calculated lengths to initialize output character vectors 
  CharacterVector from2(len);
  CharacterVector to2(len);

  //# for each value of the 'from' vector, create appropriately re-sized from2 
  //# and to2 vectors
  int ctr = 0;
  for (int i = 0; i < n; i++) {
    int nn = to[i].size();
    for (int j = 0; j < nn; j++) {
      if (j == 0) {
        if (to[i][j] != "NA") {
          from2[ctr] = from[i];
          to2[ctr] = to[i][j];
          ctr += 1;
        }
      } else {
        from2[ctr] = from[i];
        to2[ctr] = to[i][j];
        ctr += 1;
      }
    }
  }
  //# combine the new [flat] vectors into a data frame (requires row names)
  List df = List::create(_["from"] = from2, _["to"] = to2);
  df.attr("class") = "data.frame";
  df.attr("row.names") = seq(1, ctr);
  return df;
}

`bench::mark()`

To compare the three previously described functions, I’ve used the {bench} package. The code and numeric results are printed below.

## from and to vectors
from <- d$user_id
to <- d$mentions_user_id

## perform bench mark
m <- bench::mark(
  unroll_connections1 = unroll_connections1(d),
  unroll_connections2 = unroll_connections2(d),
  unroll_connections3 = unroll_connections3(from, to),
  relative = TRUE,
  min_iterations = 100
)

## print results
m %>%
  dplyr::select(expression:n_gc) %>%
  knitr::kable(digits = 2)

expression	min	mean	median	max	itr/sec	mem_alloc	n_gc
unroll_connections1	472.14	462.66	460.84	51.76	1.00	17.78	21
unroll_connections2	7.19	7.08	6.85	1.00	65.32	631.74	6
unroll_connections3	1.00	1.00	1.00	1.26	462.66	1.00	1

As you can see, the initial improvement from unroll_connections1() to unroll_connections2() was more than 60X, which is great. But, thanks to the power of {rcpp}, I was able to speed things up even more with an improvement from unroll_connections2() to unroll_connections3() of roughly 7X or roughly 450X compared to the original function!!

## plot
m$expression <- factor(m$expression, levels = rev(m$expression))
bench:::autoplot.bench_mark(m, shape = 21, size = 2.5, color = "#333333aa") +
  ggplot2::aes(fill = expression) +
  dataviz::theme_mwk() +
  ggplot2::theme(legend.position = "none",
    plot.caption = ggplot2::element_text(family = "Roboto Condensed")) +
  ggplot2::labs(x = NULL, y = "Time (mean task completion)",
    title = "Benchmarking Twitter-to-network data wrangling functions",
    subtitle = "Comparing base R and Rcpp functions for converting Twitter data into network-friendly objects",
    caption = "unroll_connections1() and unroll_connections2() are written in base R; unroll_connections3() uses Rcpp") +
  ggplot2::ggsave(here::here("content", "post", "img", "network-benchmark.png"),
    width = 7, height = 4, units = "in")

Notes

¹ Data I used to generate the example data set:

## search for up to 200 #rstats tweets from verified users
rt <- rtweet::search_tweets("#rstats filter:verified", n = 200)

## select only the node (ID/screen name) variables
d <- dplyr::select(d, user_id, mentions_user_id)

Installing R and Studio

Fri, 19 Oct 2018 00:00:00 +0000

This post describes how to download and perform a basic local install of R and Rstudio. The instructions should work for both macOS and Windows users. Although not required, installation tends to work best when operating systems are up-to-date. At the time of writing, this means R/Rstudio work best with macOS High Sierra and Windows 10.

R vs Rstudio

R is a statistical computing language/environment. It is distinct from Rstudio, which is an integrated development environment (IDE) or high- powered graphical user interface (GUI) optimized for working with the R language. In other words, R is the engine, and Rstudio is the interface. Consequently, you’ll need to install both R and Rstudio.

Download and install R

Use the following instructions to download and install the R statistical computing language/environment:

Go to the CRAN (Collective R Archive Network) website: https://cran.r-project.org/
Click on the appropriate operating system (Mac or Windows) to navigate to download page.
Download the most recent version of R.
- If Mac, select the first .pkg file listed in the “files” section. At time of writing, this is version R-3.5.1.pkg.
- If Windows, select the bold, underline link written as ‘install R for the first time’.
Double click (run) the downloaded file (check your Downloads folder). Click yes through prompts to install like any other program (default values should be okay). You may get a warning about the source of the download being unkown. Do whatever you can to allow the installation to continue—I promise the R pkg file is safe!

Download and install Rstudio

Rstudio is an integrated development environment (IDE) that makes it easy to use R. Once both R and Rstudio are installed, I’d actually recommend ignoring the actual “R” program and instead only open and use Rstudio, which will automatically call and allow interactive use of R.

Go to the free download location on Rstudio’s website: https://www.rstudio.com/products/rstudio/download/#download
Select one of the highlighted options that corresponds with your computer’s operating system (Mac or PC)
Double (run) click the downloaded file (check Downloads folder). Click yes through prompts to install like any other program (the defaults should be fine). You may get a warning about the source of the download being unkown. Do whatever you can to allow the installation to continue—I promise the R pkg file is safe!

For an actual demonstration of installing R and Rstudio using these instructions, see the appropriate video for your operating system below.

Download and install R & Rstudio on a Mac

Download and install R & Rstudio on a PC (Windows)

Using R/Rstudio

You should be able to find the Rstudio application in your computer’s Application or Program folder. Alternatively, a simple search for “Rstudio” using finder/spotlight or the Windows key should be able to locate “Rstudio” on your machine.

My R-bloggers post

Thu, 18 Oct 2018 00:00:00 +0000

I have long been a fan of R-bloggers, a content aggregating site focused on blog posts about R. It serves a useful purpose¹ and has considerable reach.² But in the first version of this blog post, I actually wrote a lengthy critique of the site where I concluded with a not-so-blunt suggestion that R-bloggers wasn’t as good as it should be. In retrospect, and after pleasant exchange about a draft of the post with Tal Galili (the creator and operator of R-bloggers), I can confidently say my post was overly nit-picky and unrealistic in my expectations for a benevolent blog-aggregating site like R-bloggers.

Background on R-bloggers

As I already mentioned, R-bloggers is an R-related content aggregating site that circulates and indexes blog posts about R. It was created, as far as I can tell, in 2005 by Tal Galili, who is, impressively, still listed as the sole maintainer of the site–though it also appears to be affiliated with the Foundation for Open Access Statistics (FOAS), so, hopefully, they provide Tal with some support.

As for its mission and a description of its basic operations, here’s the explanation straight from R-bloggers’ about section:

Figure: Screen shot of R-bloggers description/operation

And here’s the explanation of what R-bloggers offers to bloggers:

Figure: Screen shot of R-bloggers contribution description

Anyone interested in adding their blog to the R-bloggers feed is also provided with a link containing instructions and a submission form for adding a blog to R-bloggers. The guidelines for bloggers are quite reasonable–blog posts should be about R, include a minimum amount of well-written non-code content (i.e., code snippets are not discouraged, but they should be accompanied by text), contain reasonably reproducible examples/use cases (if relevant), compatible HTML code, and a link back to R-bloggers, etc.

R-bloggers’ contribution to #rstats

Ultimately, the contribution made by R-bloggers is not necessarily the production of content, but the dissemination it. When one considers some of the difficulties associated with running or automating this kind of service, it’s easy to understand why the dissemination of R-related content is a valuable contribution. But it’s perhaps even easier to understand the value of its contribution by actually trying to automate the process yourself…

So, in addition to qualifying as my R-bloggers link, the goal of this post is to create via automation a content-aggregating R-bloggers-like website.

Identifying blogs

Without scraping the R-bloggers website, I was able to accumulate a large list of R-related blogs by searching for tweets via rtweet containing R-related keywords and URLs that matched at least one of two common blog post conventions (/post/ or 2018/\\d{2}/).

## build search query with URL filters
m <- substr(Sys.Date(), 6, 7)
sq <- glue::glue(
  '(rstats OR tidyverse OR "R package") (url:post OR url:2018/{m})')

## search for most recent 100 matching tweets
rt <- rtweet::search_tweets(sq, n = 100)

## print URLs
rt %>%
  pull(urls_expanded_url) %>%
  unlist() %>%
  tfse::na_omit() %>%
  unique()
#> [1] "https://figshare.com/articles/RCCPII_Data/7928480"                     
#> [2] "http://thug-r.life/post/2019-04-03-tale-of-three-assignment-operators/"
#> [3] "https://buff.ly/2UHhohr"                                               
#> [4] "http://bit.ly/learning-lab-07"                                         
#> [5] "http://bit.ly/lstm-time-series"                                        
#> [6] "https://tenet-rccpii.github.io/rccpii-2018/"                           
#> [7] "https://carpentries.org/blog/2019/04/rccpii/"                          
#> [8] "https://nemethc.com/post/2019-04-05-seattle-bike-trafic/"

Automating a website

Then, after using rvest to extract post information and text previews from RSS feeds, I was able to automate, with the help of blogdown, a continuously updating website with a feed containing linked post previews. It took a good amount of elbow grease, but, at least initially, the task seemed surprisingly doable. I was so confident in my ability to automate an R-bloggers-like website, I even decided to expand the aim of my content aggregating feed to be about data-science generally–hence, the name, data-scribers. You can see the site for yourself at data-scribers.mikewk.com.

No match for R-bloggers

While the initial outcome appeared to be a resounding success, it wasn’t long before I realized the difficulties involved in pruning (e.g., cutting off text, including/not including code chunks, dealing with inconsistent formats, etc.), filtering (e.g., on-topic, non-trivial, consistent language, original content, non-reposts, etc.), and maintaining (checking/updating/editing algorithm, tagging posts, creating searchable and/or organized archive, etc.) a site like R-bloggers would be a lot of work. In fact, since launching data-scribers, the site has started to take on a life of its own; the range of topics and languages keep growing, and, at this point, I’m more interested to see where it goes than I am in investing additional time ensuring the feed only contains English posts, filtering via some overly-strict definition of data-science, battling with HTML formatting issues, etc.

Of course, even if I had time to prune, filter, and maintain the feed of post previews, to truly be competitive with R-bloggers, I’d also have to add numerous other features (visuals, linked tags, search bar, etc.)–and even then the site wouldn’t include any integration with R-related advertisement opportunities or job postings.

Notes

¹ R-bloggers is a centralized directory of “over 750” R-related blogs

² At the time of writing, the site has 50k email subscribers, 60k+ Twitter followers, etc.

Labelling dataviz

Thu, 20 Sep 2018 00:00:00 +0000

I still remember how hard it was to learn {ggplot2} after only knowing a little about R¹. Sure, the plots seemed pretty. But compared to the ways I had used R before, {ggplot2}’s syntax seemed almost counter-intuitive. Its pipe-like + workflow–building layer-by-layer– was like nothing I had ever used before. Not to mention, I was unfamiliar with central terms of art like “geoms” and “aesthetics”.

But then again…the plots were really pretty.

Fortunately for me, being able to generate pretty plots was a powerful motivator. Because not long after committing myself to learning how to {ggplot2}, I realized why everyone likes it so much–it’s actually really easy! Once I learned about the key building blocks of ggplot(), aes(), and geom_.*()), I could create pretty plots for all sorts of data types and relationships.

It’s in the details

Over time my #dataviz has gotten a lot better, but it’s had very little to do the actual plotting of data points ({ggplot2} outputs beautiful plots by default). Instead, my dataviz has improved because I learned how to (a) more effectively label scales, data points, and other dimensions of a plot and (b) (re)size and save high-resolution plots using nice-looking fonts.

With this in mind, my goal with this post is to demonstrate how data visualizations can be improved via proper labelling. And since this idea was inspired by my last post, I will extend the example about the relationship between miles per gallon and number of cylinders. If you read the setup section from the last post, you can skip ahead (it’s the same).

Setup

To follow along with the examples in this post, you will need to load the {tidyverse} set of packages and define a couple stylistic functions used throughout to make the plots even prettier.

## load tidyverse
library(tidyverse)
#> ── Attaching packages ───────────────────────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 3.0.0.9000     ✔ purrr   0.2.5     
#> ✔ tibble  1.4.2          ✔ dplyr   0.7.6     
#> ✔ tidyr   0.8.1          ✔ stringr 1.3.1     
#> ✔ readr   1.1.1          ✔ forcats 0.3.0
#> ── Conflicts ──────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()

## create style theme
my_theme <- function() {
  theme_minimal(base_family = "Roboto Condensed") + 
    theme(plot.title = element_text(size = rel(1.5), face = "bold"), 
      plot.subtitle = element_text(size = rel(1.1)),
      plot.caption = element_text(color = "#777777", vjust = 0),
      axis.title = element_text(size = rel(.9), hjust = 0.95, face = "italic"),
      panel.grid.major = element_line(size = rel(.1), color = "#000000"), 
      panel.grid.minor = element_line(size = rel(.05), color = "#000000"), 
      legend.position = "none")
}
my_labs <- function() {
  labs(title = "Average miles per gallon by number of cylinders", 
    subtitle = "Scatter plot depicting average miles per gallon aggregated by number of cylinders",
    x = "Number of cylinders", y = "Miles per gallon",
    caption = "Source: Estimates calculated from the 'mtcars' data set")
}
my_save <- function(file) {
  ggsave(file, width = 7, height = 4.5, units = "in")
}

The data set featured in this post is mtcars, which is bundled as part of the core datasets package. Specifically, examples will feature the mpg (miles per gallon) and cyl (number of cylinders) variables.

## print first six rows
head(mtcars)
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Labelling dataviz

I think most would agree a good data visualization clearly conveys a pattern (or lack of pattern) while being easy to understand, while a great data visualization conveys a pattern (or lack of pattern) while being easy to understand and aesthetically pleasing. The difference between good and great can be something as minor as color palette, but, in my experience, more often than not the only difference between a good visualization and great visualizations is labelling.

In my last post, for example, the first successful plot of mpg by cyl was only okay–it’s a little bland and it uses an actual expression for an axis title.

But then I replaced the expression and added a custom theme and a few more labels, and I think it started to border on being good.

The combination of style changes and labels clearly made a big difference but, still, I don’t think the above plot is mind-blowing or overly impressive.

Since there aren’t that many data points, I think this visualization can be further improved–with the help of {ggrepel}–by labelling the individual data points–either as an additional layer or as a standalone plot (I didn’t think the summarized cyl estimates added much so I dropped the mean line/points).

## - add row names as make variable
## - add noise to cyl for spacing (store as cyl2)
## - plot and format labels with ggrepel
## - adjust x-axis labels
## - specify custom fill colors
mtcars %>%
  mutate(make = row.names(mtcars),
    cyl2 = case_when(
      cyl == 4 ~ cyl - runif(1, .25, .5),
      cyl == 6 ~ cyl - runif(1, .00, .1),
      cyl == 8 ~ cyl + runif(1, .75, 1.25), 
      TRUE ~ cyl
    )) %>%
  ggplot(aes(x = cyl2, y = mpg)) + 
  ggrepel::geom_label_repel(aes(fill = factor(cyl), label = make), 
    family = "Roboto Condensed Light", label.padding = 0.2, label.size = .25, 
    min.segment.length = 100, color = "black", size = 3.4) + 
  my_theme() + 
  my_labs() + 
  scale_x_continuous(breaks = c(4, 6, 8)) + 
  scale_fill_manual(values = c("#efd0ef", "#d0efd0", "#d0daef")) +
  my_save("img/tick-marks-final.png")

As you can see in the code chunk above, I also added some additional noise to the cyl variable to help out {ggrepel}’s spacing algorithm. The approach made it possible to plot and label each car in the data set without overloading or distracting the image with too much information. So, now, not only does the image convey the pattern between mpg and cyl, but it does so in a way that more people can recognize ( 4-cylinders is less meaningful than Honda Civic, for example), while arguably being even more visually pleasing.

Notes

¹ I knew just enough to read in data, do some structural equation modeling, and generate some simple plots via base::plot() and base::histogram().

Tick marks, variable names, and ggplot2

Mon, 17 Sep 2018 00:00:00 +0000

A popular workflow in R uses {dplyr} to group_by() and then summarise()¹ variables. It’s an intuitive and easy way to aggregate and describe data, especially along multiple dimensions. The cost of being both powerful and user-friendly, however, is its arguably inconvenient default method for assigning names to summarized values. As the code illustrates below, users can provide their own names when using summarize().

## explicitly named summarize variable
mtcars %>%
  group_by(cyl) %>%
  summarize(mpg = mean(mpg))
#> # A tibble: 3 x 2
#>     cyl   mpg
#>   <dbl> <dbl>
#> 1     4  26.7
#> 2     6  19.7
#> 3     8  15.1

But when users don’t explicitly name the summarized values, instead of inheriting the name of a summarized variable (in this case mpg), variables are named–by default–with the text of the expression used to create the summarized value.

For example, the code below summarizes by estimating the mean mpg for cars grouped by number of cyl. The code is fairly straight forward, and you can probably see why users often assume the returned summarized data would contain two variables cyl and mpg.

## unnamed summarize variable
mtcars %>%
  group_by(cyl) %>%
  summarize(mean(mpg))
#> # A tibble: 3 x 2
#>     cyl `mean(mpg)`
#>   <dbl>       <dbl>
#> 1     4        26.7
#> 2     6        19.7
#> 3     8        15.1

But as you can see, the variable names wind up being cyl and mean(mpg)– instead of simply cyl and mpg. This default behavior may seem obnoxious at first, but it makes a lot of sense when you think about using two or more variables when calculating summarize() values.

Regardless, while it’s definitely a good idea to provide your own summary variable names, you will invariably find yourself in a situation where you would like to plot summarized variables that were named using the text of the expressions used to create them.

Thus, my goal with this post is to identify three common mistakes users make when attempting to map variables from dplyr::summarize() to aesthetic dimensions of a plot with {ggplot2} and conclude by describing a solution.

Setup

To follow along with the examples in this post, you will need to load the {tidyverse} set of packages and define a couple stylistic functions used throughout to make the plots even prettier.

## load tidyverse
library(tidyverse)
#> ── Attaching packages ───────────────────────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 3.0.0.9000     ✔ purrr   0.2.5     
#> ✔ tibble  1.4.2          ✔ dplyr   0.7.6     
#> ✔ tidyr   0.8.1          ✔ stringr 1.3.1     
#> ✔ readr   1.1.1          ✔ forcats 0.3.0
#> ── Conflicts ──────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()

## create style theme
my_theme <- function() {
  theme_minimal(base_family = "Roboto Condensed") + 
    theme(plot.title = element_text(size = rel(1.5), face = "bold"), 
      plot.subtitle = element_text(size = rel(1.1)),
      plot.caption = element_text(color = "#777777", vjust = 0),
      axis.title = element_text(size = rel(.9), hjust = 0.95, face = "italic"),
      panel.grid.major = element_line(size = rel(.1), color = "#000000"), 
      panel.grid.minor = element_line(size = rel(.05), color = "#000000"), 
      legend.position = "none")
}
my_labs <- function() {
  labs(title = "Average miles per gallon by number of cylinders", 
    subtitle = "Scatter plot depicting average miles per gallon aggregated by number of cylinders",
    x = "Number of cylinders", y = "Miles per gallon",
    caption = "Source: Estimates calculated from the 'mtcars' data set")
}
my_save <- function(file) {
  ggsave(file, width = 7, height = 4.5, units = "in")
}

## print first six rows
head(mtcars)
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Mapping incorrect names

When visualizing data with ggplot2, one of the first and most important steps entails mapping observed variables in the data set to the aesthetic dimensions of a plot. But aesthetic mapping will only work as expected when you provide the correct names via ggplot2::aes().

The following section describes three common mistakes users make that result in the mapping of incorrect names.

1. Assuming a statistic inherits the name of a variable.

A common mistake is to assume that summarizing via mean() or median() results in a variable with the same name. For example, if we summarize the mean of mpg like we did above, i.e., summarize(mean(mpg)), and then try to map y = mpg, we get an error because “mpg” doesn’t exist.

## this gets an error because there is no variable named "mpg"
mtcars %>%
  group_by(cyl) %>%
  summarize(mean(mpg)) %>%
  ggplot(aes(x = cyl, y = mpg)) + 
  geom_point() + 
  geom_line()
#> Error: Aesthetics must be either length 1 or the same as the data (3): x, y

We know from the summarize section above the variable’s name is actually mean(mpg). As this example illustrates, it is incorrect to assume that summarized estimates inherit the name of the variable they summarize. This may seem annoying at first, but it makes sense when you think about times when you may want to summarize using two or more variables in the data set.

2. Repeating the expression used in `summarize()`.

A second common mistake is to assume that you can simply repeat the expression used in summarize() when specifying aesthetic mappings.

## this also doesn't work because it tries to caculate the mean of mpg
mtcars %>%
  group_by(cyl) %>%
  summarize(mean(mpg)) %>%
  ggplot(aes(x = cyl, y = mean(mpg))) + 
  geom_point() + 
  geom_line() + 
  my_save("img/empty-plot.png")
#> Warning in mean.default(mpg): argument is not numeric or logical: returning
#> NA

#> Warning in mean.default(mpg): argument is not numeric or logical: returning
#> NA

#> Warning in mean.default(mpg): argument is not numeric or logical: returning
#> NA
#> Warning: Removed 3 rows containing missing values (geom_point).
#> Warning: Removed 3 rows containing missing values (geom_path).
#> Warning in mean.default(mpg): argument is not numeric or logical: returning
#> NA

#> Warning in mean.default(mpg): argument is not numeric or logical: returning
#> NA

#> Warning in mean.default(mpg): argument is not numeric or logical: returning
#> NA
#> Warning: Removed 3 rows containing missing values (geom_point).
#> Warning: Removed 3 rows containing missing values (geom_path).

The result is a handful of warnings and an empty plot. The above code fails because it tries to calculate mean of mpg, which, again, doesn’t exist in the summarized data.

3. Passing the expression as a quoted string.

The third common mistake is to treat the summarized expression name as a string.

## if we put quotes around it, it assumes it's a string
mtcars %>%
  group_by(cyl) %>%
  summarize(mean(mpg)) %>%
  ggplot(aes(x = cyl, y = "mean(mpg)")) + 
  geom_point() + 
  geom_line() + 
  my_save("img/static-y.png")

This time we get a plot and no warnings, but it’s clearly not right. It shows every y value is exactly the same, but it seems far fetched to think the average miles per gallon would not vary with number of cylinders.

In this case, the literal string "mean(mpg)" is mapped to the y variable value, which means it’s converted to a factor and the single factor level is coded as 1 at each observation.

Solution: use tick marks

At this point it should be clear the name of the summarized mpg variable is actually “mean(mpg),” only now we also know wrapping the expression with quotes doesn’t work because it assumes the expression is a literal string, not a variable name.

The solution to correctly mapping unnamed summarize() variables is to use tick marks–the apostrophe-like symbol at the top-left of your keyboard. Tick marks work a lot like quotes insofar as they open and close and wrap all elements into a single object. The difference is tick marks assume the marked object references a symbol. To illustrate, the code below assigns 10 random numbers to x and then prints it using both ticks and quotes.

## assign 10 random numbers to x
x <- rnorm(10)

## print x wrapped in quotes
"x"
#> [1] "x"

## print x wrapped in tick marks
`x`
#>  [1] -0.4614799  0.9832479 -1.7872899  0.2977996  0.1209820  1.3454420
#>  [7] -0.6433342  0.4772910  1.8410117  0.0823669

So, really, tick marks are used to distinguish symbols that contain one or more unfriendly punctuation/characters, e.g., parenthesis, dashes, spaces, etc.

With this knowledge, we can now fix the featured summarize() example by wrapping the summarized expression, which functions as the name of the summarized variable, in tick marks.

## if we put quotes around it, aes() assumes we are entering a string
mtcars %>%
  group_by(cyl) %>%
  summarize(mean(mpg)) %>%
  ggplot(aes(x = cyl, y = `mean(mpg)`)) + 
  geom_point() + 
  geom_line() + 
  my_save("img/tick-marks.png")

Of course, most audiences don’t really want to see expression text on a plot, so we can improve this plot by adding some better labels and a custom theme via the previously defined my_theme() and my_labs() functions.

## use tick marks instead of quotes to indicate variable name
mtcars %>%
  group_by(cyl) %>%
  summarize(mean(mpg)) %>%
  ggplot(aes(x = cyl, y = `mean(mpg)`)) + 
  geom_point() + 
  geom_line() + 
  my_theme() + 
  my_labs() + 
  my_save("img/with-labs.png")

Notes

¹ The s and z toward the end of summarise() and summarize() are interchangeable.

Michael W. Kearney

Three things to know beyond base R

1. The pipe

2. The tibble

3. Select, filter, arrange, and mutate

Faster code with Rcpp

The problem

Pure R code (slowest)

Pure R code (faster)

Rcpp code (fastest)

bench::mark()

Notes

Installing R and Studio

R vs Rstudio

Download and install R

Download and install Rstudio

Download and install R & Rstudio on a Mac

Download and install R & Rstudio on a PC (Windows)

Using R/Rstudio

My R-bloggers post

Background on R-bloggers

R-bloggers’ contribution to #rstats

Identifying blogs

Automating a website

No match for R-bloggers

Notes

Labelling dataviz

It’s in the details

Setup

Labelling dataviz

Notes

Tick marks, variable names, and ggplot2

Setup

Mapping incorrect names

1. Assuming a statistic inherits the name of a variable.

2. Repeating the expression used in summarize().

3. Passing the expression as a quoted string.

Solution: use tick marks

Notes

`bench::mark()`

2. Repeating the expression used in `summarize()`.