Collapse Cheat Sheet
Collapse Cheat Sheet
Introduction                                                               Fast Statistical Functions                                                               Grouping and Ordering                                                                       Fast Data Manipulation
collapse is a C/C++ based package supporting advanced                      Fast functions to perform column–wise grouped and                                        Optimized functions for grouping, ordering, unique                                          Minimal overhead implementations
(grouped, weighted, time series, panel data and recursive)                 weighted computations on matrix-like objects                                             values, splitting & recombining, and dealing with factors
                                                                                                                                                                                                                                                                fselect[<-]() - select/replace columns
statistical operations in R, with very efficient low-level
vectorizations across both groups and columns.                                  fmean, fmedian, fmode, fsum, fprod, fsd, fvar                                       GRP() - create a grouping object (class ’GRP’): pass to g arg.                              fsubset() - subset data (rows and columns)
                                                                                fmin, fmax, fnth, ffirst, flast, fnobs, fndistinct                                  g <- GRP(iris, ~ Species) # or GRP(iris£Species) or GRP(iris["Species"])
It also offers a flexible, class-agnostic, approach to data                                                                                                         fndistinct(iris[1:4], g) # Computation without grouping overhead                            ss() - fast alternative to [, particularly for data frames
transformation in R: handling matrix and data frame based                  Syntax                                                                                   ##            Sepal.Length Sepal.Width Petal.Length Petal.Width                             [row|col]order[v]() - reorder (sort) rows and columns
objects in a uniform, attribute preserving, way, and ensuring                                                                                                       ## setosa               15          16            9           6
                                                                                                                                                                    ## versicolor           21          14           19           9                             fmutate(), fsummarise() - dplyr -like, incl. across() feature
seamless compatibility with dplyr / (grouped) tibble, data.table,               FUN(x, g = NULL, [w = NULL], TRA = NULL,                                            ## virginica            21          13           20          12
xts, sf and plm classes for panel data (’pseries’, ’pdata.frame’).                  [na.rm = TRUE], use.g.names = TRUE,                                                                                                                                         [f|set]transform[v][<-]() - transform cols (by reference)
                                                                                                                                                                    fgroup by() - attach ’GRP’ object to data: a class-agnostic
collapse provides full control to the user for statistical                          [drop = TRUE], [nthreads = 1L])
                                                                                                                                                                                  grouped frame supporting fast computations                                    fcompute[v]() - compute new cols dropping existing ones
programming - with several ways to reach the same outcome                                                                                                           mtcars |> fgroup_by(cyl, vs, am) |> ss(1:2)
and rich optimization possibilities. Its default is na.rm = TRUE,                  x vector, matrix, or (grouped) data frame / list                                                                                                                             [f|set]rename() - rename (any object with ’names’ attribute)
                                                                                                                                                                    ##               mpg cyl disp hp drat   wt qsec vs am gear carb
and implemented at very low cost at the algorithm level.                           g [optional] (list of) vectors / factors or GRP() object                         ## Mazda RX4      21   6 160 110 3.9 2.620 16.46 0 1     4    4                             [set]relabel() - assign/change variable labels (’label’ attr.)
                                                                                                                                                                    ## Mazda RX4 Wag 21    6 160 110 3.9 2.875 17.02 0 1     4    4
Calling help("collapse-documentation") brings up a                                 w [optional] vector of (frequency) weights                                       ##                                                                                          get vars[<-]() - select/replace columns (standard eval.)
                                                                                                                                                                    ## Grouped by: cyl, vs, am [7 | 5 (3.8) 1-12]
detailed documentation, which is also available online. See                      TRA [optional] operation to transform data with computed                                                                                                                       [num|cat|char|fact|logi|date] vars[<-]() - select/
                                                                                                                                                                    # Group Stats: [N. groups | mean (sd) min-max of group sizes]
also the fastverse package/project for a recommended set of                          statistics (see FUN argument to TRA() and Examples)                            # Fast Functions also have a grouped_df method: here wt-weighted medians                        replace columns by data type or retrieve names/indices
complimentary packages and easy package management.                                                                                                                 mtcars |> fgroup_by(cyl, vs, am) |> fmedian(wt) |> head(3)
                                                                                drop drop matrix / data frame dimensions. default TRUE                                                                                                                          add vars[<-]() - add or column-bind columns
                                                                                                                                                                    ##   cyl vs am sum.wt mpg disp hp          drat qsec gear carb
                                                                                                                                                                    ## 1   4 0 1 2.140 26.0 120.3 91           4.43 16.70   5    2
                                                                           Examples                                                                                                                                                                             Examples
Row/Column Arithmetic (by Reference)                                       fmean(AirPassengers)    # Vector
                                                                                                                                                                    ## 2
                                                                                                                                                                    ## 3
                                                                                                                                                                           4 1 0 8.805 22.8 140.8 95
                                                                                                                                                                           4 1 1 14.198 30.4 79.0 66
                                                                                                                                                                                                               3.70 20.01
                                                                                                                                                                                                               4.08 18.61
                                                                                                                                                                                                                            4
                                                                                                                                                                                                                            4
                                                                                                                                                                                                                                 2
                                                                                                                                                                                                                                 1
                                                                                                                                                                                                                                                                mtcars |> fsubset(mpg > fnth(mpg, 0.95), disp:wt, cylinders = cyl)
Column-wise sweeping out of vectors/matrices/DFs/lists                     ## [1] 280.2986                                                                          GRPN(), fgroup vars(), fungroup() - get group count,                                        ##                disp hp drat    wt cylinders
                                                                           fmean(AirPassengers, w = cycle(AirPassengers))        # Weighted mean                        grouping columns/variables, and ungroup data                                            ## Fiat 128       78.7 66 4.08 2.200         4
   %cr%, %c+%, %c-%, %c*%, %c/% e.g. Z = X %c/% rowSums(X)                 ## [1] 284.3397
                                                                                                                                                                                                                                                                ## Toyota Corolla 71.1 65 4.22 1.835         4
## Mazda RX4       6 2.620 0.9691687 0.386125                               ## Indexed   by:     iso3c [1] | year [2 (61)]                                                                                                                                          ## [1] "var1" "var2" "var3"
## Mazda RX4 Wag   6 2.875 0.9691687 0.386125
                                                                                                                                                                        varying() - check variation within groups (panel-ids)                                       .c(values, vectors) %=% eigen(cov(mtcars)) # Multiple Assignment
                                                                            # Index stats: [N. ids] | [N. periods (tot.N. periods: (max-min)/GCD)]
# Much shorter than fsubset(mpg > fmean(mpg, cyl, TRA = "replace"))         LIFEEXi = wldi$LIFEEX # Indexed series                                                      pwcor(), pwcov(), pwnobs() - pairwise correlations,                                         # Variable labels: vlabels[<-], [set]relabel() etc. namlab() shows summary
                                                                            str(LIFEEXi, strict.width = "cut")                                                                                                                                                      namlab(wlddev[c(2, 9)], N = TRUE, Ndist = TRUE, class = TRUE)
mtcars |> fsubset(mpg > B(mpg, cyl)) |> head(2)                                                                                                                             covariance and obs. (with P-value and pretty printing)
##               mpg cyl disp hp drat   wt qsec vs am gear carb             ##   'indexed_series' num [1:13176] 32.4 33 33.5 34 34.5 ...                                                                                                                            ##   Variable   Class    N Ndist                             Label
## Mazda RX4      21   6 160 110 3.9 2.620 16.46 0 1     4    4             ##   - attr(*, "index_df")=Classes 'index_df', 'pindex' and 'data.frame'..                                                                                                              ## 1    iso3c factor 13176   216                      Country Code
                                                                            ##    ..$ iso3c: Factor w/ 216 levels "ABW","AFG","AGO",..: 2 2 2 2 2 2 ..                                                                                                              ## 2    PCGDP numeric 9470 9470 GDP per capita (constant 2010 US$)
## Mazda RX4 Wag 21    6 160 110 3.9 2.875 17.02 0 1
# Regression with cyl fixed effects - a la Mundlak (1978)
                                                         4    4
                                                                            ##    ..$ year : Ord.factor w/ 61 levels "1960"<"1961"<..: 1 2 3 4 5 6 7..                  List Processing
lm(mpg ~ carb + B(carb, cyl), data = mtcars) |> coef()                      LIFEEXi[1:7] # Subsetting indexed series                                                    Functions to process (nested) lists (of data objects)
##
##
     (Intercept)
       34.829652
                           carb B(carb, cyl)
                      -0.465511    -4.775032
                                                                            ## [1] 32.446 32.962 33.471 33.971 34.463 34.948 35.430
                                                                            ##                                                                                          ldepth() - level of nesting of list                                                         API Extensions
# Fast grouped (vs) bivariate regression slopes: mpg ~ carb
                                                                            ## Indexed by: iso3c [1] | year [7 (61)]
                                                                                                                                                                        is unlistable() - is list composed of atomic objects                                        Shorthands for frequently used functions
mtcars |> fgroup_by(vs) |> fmutate(dm_carb = W(carb)) |>                    c(is_irregular(LIFEEXi), is_irregular(LIFEEXi[-5])) # Is irregular?
  fsummarise(beta = fsum(mpg, dm_carb) %/=% fsum(dm_carb^2))                                                                                                            has elem() - search if list contains certain elements                                       fselect -> slt, fsubset -> sbt, fmutate -> mtt,
                                                                            ## [1] FALSE       TRUE
##   vs      beta
                                                                                                                                                                                                                                                                    [f/set]transform[v] -> [set]tfm[v], fsummarise ->
## 1 0 -0.5557241                                                           Note: ’indexed series’ and frames are supported via existing                                get elem() - pull out elements from list / subset list                                      smr, across -> acr, fgroup by -> gby, finteraction
## 2 1 -2.0706468                                                           ’pseries’/’pdata.frame’ methods for time series/panel functions.                            atomic elem[<-](), list elem[<-]() - get list with atomic /                                 -> itn, findex by -> iby, findex -> ix, frename ->
# Residuals from regressing on 'Petal' vars and 'Species' FE                                                                                                                sub-list elements, examining only first level of list                                   rnm, get vars -> gv, num vars -> nv, add vars -> av
fhdwithin(iris[1:2], iris[3:5]) |> head(2)                                  Fast functions to perform time-based computations on
##   Sepal.Length Sepal.Width                                                                                                                                           reg elem(), irreg elem() - get full list tree leading to atomic
                                                                            (irregular) time series and (unbalanced) panel data                                                                                                                                     Namespace masking
## 1   0.14989286   0.1102684                                                                                                                                               (’regular’) or non-atomic (’irregular’) elements
## 2 -0.05010714 -0.3897316                                                                                                                                                                                                                                         Can set option(collpse mask = c(...)) with a vector of
# Detrending with country-level cubic polynomials                           Lags/Leads, Differences, Growth Rates and Cumulative Sums                                   rsplit() - efficient (recursive) splitting
                                                                                                                                                                                                                                                                    functions starting with f-, to export versions without f-, masking
HDW(wlddev, PCGDP + LIFEEX + POP ~ iso3c * poly(year, 3)) |> head(2)        flag(x, n = 1, g = NULL, t = NULL, fill = NA, ...)                                          t list() - efficient list transpose (transpose lists of lists)                              base R or dplyr. A few keywords exist to mask multiple
##    HDW.PCGDP HDW.LIFEEX  HDW.POP                                         fdiff(x, n = 1, diff = 1, g = NULL, t = NULL,
## 43 -258.4069 0.2360285 -317459.1                                                                                                                                     rapply2d() - recursive apply to lists of data objects                                       functions, see help("collapse-options"). This allows clean
                                                                                  fill = NA, log = FALSE, rho = 1, ...)
## 44 -119.5600 0.1136432 -33900.2                                                                                                                                                                                                                                  & fast code, but poses additional namespace challenges:
                                                                            fgrowth(x, n = 1, diff = 1, g = NULL, t = NULL, fill                                        unlist2d() - recursive row-binding to data.frame
# Note: HD centering/prediction and polynomials requires package 'fixest'                                                                                                                                                                                           # Masking all f- functions and specials n = GRPN and table = qtab
                                                                             = NA, logdiff = FALSE, scale = 100, power = 1, ...)                                                                                                                                    options(collapse_mask = "all")
                                                                            fcumsum(x, g = NULL, o = NULL, na.rm = TRUE,                                                Example: Nested Linear Models                                                               library(collapse)
                                                                                    fill = FALSE, check.o = TRUE, ...)                                                  (dl <- mtcars |> rsplit(mpg + hp + carb ~ vs + am)) |> str(max.level = 2)                   # The folowing is 100% collapse code, apart from the base pipe
Linear Models                                                                                                                                                           ## List of 2
                                                                                                                                                                                                                                                                    wlddev |>
                                                                            Statistical Operators: L(), F(), D(), Dlog(), G()                                           ## $ 0:List of 2
                                                                                                                                                                        ##   ..$ 0:'data.frame':       12 obs. of 3 variables:                                        subset(year >= 1990) |>
Fast (barebones) linear model fitting with 6 different solvers                                                                                                                                                                                                        group_by(year) |>
                                                                                                                                                                        ##   ..$ 1:'data.frame':       6 obs. of 3 variables:
flm(y, X, w = NULL, add.icpt = FALSE, method = "lm")                        Example: Computing Growth Rates                                                             ## $ 1:List of 2                                                                              summarise(n = n(), across(PCGDP:GINI, mean, w = POP))
                                                                                                                                                                        ##   ..$ 0:'data.frame':       7 obs. of   3 variables:
Fast R2 -based F-test of exclusion restrictions for lm’s (with FE)          # Ad-hoc use: note that G() supports formulas which fgrowth() doesn't
                                                                                                                                                                        ##   ..$ 1:'data.frame':       7 obs. of   3 variables:                                     with(mtcars, table(cyl, vs, am))
                                                                            fgrowth(AirPassengers) |> head()
fFtest(y, exc, X = NULL, w = NULL, full.df = TRUE)                                                                                                                      nest_lm <- dl |> rapply2d(lm, formula = mpg ~ .)
                                                                                                                                                                                                                                                                    sum(mtcars)
                                                                            ## [1]         NA      5.357143 11.864407 -2.272727 -6.201550 11.570248                                                                                                                 diff(EuStockMarkets)
                                                                                                                                                                        (nest_coef <- nest_lm |> rapply2d(summary, classes = "lm") |>                               droplevels(wlddev)
Both functions also have formula interfaces:                                G(wlddev, c(1, 10), by = PCGDP ~ iso3c, t = ~ year) |> ss(11:12)                                 get_elem("coefficients")) |> str(give.attr = FALSE, strict = "cut")                    mean(nv(iris), g = iris$Species)
flm(cbind(mpg, disp) ~ hp + carb, weights = wt, mtcars)                     ##   iso3c year G1.PCGDP L10G1.PCGDP                                                        ## List of 2                                                                                scale(nv(GGDC10S), g = GGDC10S$Variable)
                                                                            ## 1   AFG 1970       NA          NA                                                        ## $ 0:List of      2                                                                       unique(GGDC10S, cols = c("Variable", "Country"))
##                     mpg       disp
                                                                            ## 2   AFG 1971       NA          NA                                                        ##   ..$ 0: num     [1:3,   1:4] 15.8791 0.0683 -4.5715 3.655 0.0345 ...                    range(wlddev$date)
## (Intercept) 28.48401839 42.155002
## hp          -0.06834996   2.101036                                       wlddev |> fgroup_by(iso3c) |> fselect(iso3c, year, PCGDP, LIFEEX) |>                        ##   ..$ 1: num     [1:3,   1:4] 26.9556 -0.0319 -0.308 2.293 0.0149 ...
## carb         0.33207257 -38.183910                                         fmutate(PCGDP_growth = fgrowth(PCGDP, t = year)) |> head(2)                               ## $ 1:List of      2                                                                       wlddev |>
                                                                            ##   iso3c year PCGDP LIFEEX PCGDP_growth                                                   ##   ..$ 0: num     [1:3,   1:4] 30.896903 -0.099403 -0.000332 3.346033 0.035..               index_by(iso3c, year) |>
# Test the exclusion of cyl-dummies and hp.
                                                                            ## 1   AFG 1960    NA 32.446           NA                                                   ##   ..$ 1: num     [1:3,   1:4] 37.0012 -0.1155 0.4762 7.3316 0.0894 ...                     mutate(PCGDP_lag = lag(PCGDP),
fFtest(mpg ~ qF(cyl) + hp | carb + qF(am), weights = wt, mtcars)
                                                                            ## 2   AFG 1961    NA 32.962           NA                                                   nest_coef |> unlist2d(c("vs", "am"), row.names = "variable") |> head(2)                              PCGDP_diff = PCGDP - PCGDP_lag,
##                     R-Sq. DF1 DF2 F-Stat. P-Value                                                                                                                                                                                                                         PCGDP_growth = growth(PCGDP)) |> unindex()
## Full Model          0.812   5 26 22.479     0.000                        settransform(wlddev, PCGDP_growth = G(PCGDP, g = iso3c, t = year))                          ##   vs am  variable    Estimate Std. Error t value     Pr(>|t|)
## Restricted Model    0.674   2 29 30.041     0.000                        # Note: can omit t -> requires consecutive observations and groups                          ## 1 0 0 (Intercept) 15.87914500 3.65495315 4.344555 0.001865018                            The best way to set this option is inside an .Rprofile file
## Exclusion Rest.     0.138   3 26    6.351   0.002                        # Usage with indexed series / frames:                                                       ## 2 0 0          hp 0.06832467 0.03449076 1.980956 0.078938069
                                                                                                                                                                                                                                                                    placed in the user or project directory. Use it carefully.
Page 2 of 2                                                                                           CC-BY-SA Sebastian Krantz • Learn more at sebkrantz.github.io/collapse • Source code at github.com/SebKrantz/collapse • Updates announced at twitter.com/collapse R - #rcollapse • Cheatsheet created for collapse version 1.8.8 • Updated: 2022-08