R Compendium
Data Structures in R
Vector Data frame
An ordered collection of data of the same type It is a table with rows and columns; data within each column has
> a = c(1,2,3) the same type (e.g. number, text, logical), but different columns
> a*2 may have different types.
[1] 2 4 6
Example:
Matrix >a
A rectangular table of data of the same type Policy_no premium State
12345 100 IL
List
25486 400 NY
An ordered collection of data of arbitrary types.
63254 350 FL
> doe = list(name="john",age=28,married=F)
> doe$name
[1] "john“
> doe$age
[1] 28
Linear Rectangular
All Same Type Vector Matrix
Mixed List Data frame
Lapply, Sapply & Apply
Lapply Sapply
•When the same or similar tasks need to be performed multiple times for sapply( li, fct )
all elements of a list or for all columns of an array. Like apply, but tries to simplify the result, by converting it
•May be easier and faster than “for” loops into a vector or array of appropriate size
•lapply(li, function )
•To each element of the list li, the function function is applied. > li = list("klaus","martin","georg")
•The result is a list whose elements are the individual function > sapply(li, toupper)
results. [1] "KLAUS" "MARTIN" "GEORG"
> li = list("klaus","martin”)
> lapply(li, toupper) > fct = function(x) { return(c(x, x*x)) }
> [[1]] > sapply(1:5, fct)
> [1] "KLAUS" [,1] [,2] [,3] [,4] [,5]
> [[2]] [1,] 1 2 3 4 5
> [1] "MARTIN" [2,] 1 4 9 16 25
Apply
apply( arr, margin, fct )
Apply the function fct along some dimensions of the array arr, according to margin, and return a vector or array of the appropriate size.
>x
[,1] [,2] [,3] > apply(x, 1, sum)
[1,] 5 7 0 [1] 12 24 17 14
[2,] 7 9 8 > apply(x, 2, sum)
[3,] 4 6 7
[4,] 6 3 5
Regular expressions
Functions for detecting patterns Functions for replacing patterns
• grep(pattern, string, value = TRUE) • sub(pattern, replacement, string) - replace first match
• grepl(pattern , string) • gsub(pattern, replacement, string) - replace all
matches
Importing & Exporting data
Understanding data table
DT[i , j , by ]
Which rows ? What to do? Grouped by what?
- vector of row numbers - summarizing - one or more columns
- logical vector - updating variable(s) - on the fly grouping var(s)
- another data table - adding variable(s)
• Order of execution
DT[i , j , by ] DT[ 1 , 3 , 2 ]
Filtering rows & selecting columns data table
Operation Syntax
Subsetting rows by numbers # select first to tenth row
dt[1:10 , ]
Using column names to select rows # selecting rows with subln_grp as "PL"
based on a condition dt[ subln_grp == "PL" , ]
# Selecting rows which don't have NA in premium field
dt[ !is.na(premium) , ]
# Select rows with state as Florida and premium greater than zero
dt[ state == "FL" & premium > 0 , ]
Select columns # Select multiple columns
dt[ , .(policy_no , premium , state)]
Note - .() is an alias to list(). If .() is used, the returned value is a data.table. If .() is
not used, the result is a vector.
Subsetting rows and selecting # Select only rows where segment is "retail" and relevant columns
columns together dt[ segment == "retail" , .(policy_no , premium , state) ]
Summarizing data table
• Count and aggregation
Operation Syntax
Count # Counting number of policies with PL as their subln_grp
dt[ subln_grp == "PL" , .N ]
Count distinct # Count unique number of segments
dt[ , uniqueN(segment)]
uniqueN( dt , by = "segment")
# Count unique number of segments for Florida
dt[ state == "FL" , uniqueN(segment) ]
Simple aggregation # Taking average premium and count of policies
dt[ , .(count = .N , average = mean(premium , na.rm = TRUE)) ]
Aggregation including filtering # Taking average premium and count of policies for which deductible is greater
than $100
dt[ deductible > 100 , .(count = .N , average = mean(premium , na.rm = TRUE)) ]
Summarizing data table
• Group by
Operation Syntax
Simple group by # Taking average premium and count of policies by segment
dt[ , .(count = .N , average = mean(premium , na.rm = TRUE)) , by = segment ]
# Taking average premium and count of policies for which deductible is greater
than $100 by segment
dt[ deductible > 100 , .(count = .N , average = mean(premium , na.rm = TRUE)) ,
by = segment ]
OR
dt[ , .(count = .N , average = mean(premium , na.rm = TRUE)) ,
by = .(segment , ded = deductible > 100 ) ]
Special symbol - .SD & .SDcols
SD = Subset of Data .SDcols
- a data.table by itself - Specifies the columns of data table that are in .SD
- holds data of current group as defined in by - Allows to apply a function to all rows , but only to selected
columns
- when no by, .SD applies to whole data.table
- Comes in handy if there are too many columns and a
- allows for calculations on multiple columns
particular operation is to be performed on a subset of columns
Summarizing data table
Operation Syntax
.SD & .SDcols # Computing average premium using .SD
dt[ , lapply(.SD , function (x) mean( x , na.rm = TRUE) ) , by = segment , .SDcols
= "premium" ]
OR
dt[ , lapply(.SD , function (x) mean( x , na.rm = TRUE) ) , by = segment , .SDcols
= grep( "prem” , names(dt)) ]
Update, add, delete variables in data table
Special operator: :=
Updates a data table in place (by reference)
Can be used to:
• update existing columns
• add new columns
• delete columns
No need to use <-
Operation Syntax
Updating variables # Doubling the premium amount
dt[ , premium := premium *2]
# Doubling both premium and deductible amount
dt[ , `:=` (premium = premium*2 ,
deductible = deductible * 2) ]
Updating variables by group # Modifying premium by segment
dt[ , premium := premium * uniqueN(deductible) / .N , by = segment ]
Update, add, delete variables in data table
Operation Syntax
Adding variables # Adding new premium variable with premium doubled
dt[ , new_premium := premium *2]
Adding variables by group # Computing state premium
dt[ , state_premium := sum(premium , na.rm = TRUE) , by = state ]
# Computing state premium and state_deductible
dt[ , `:=` (state_premium = sum(premium , na.rm = TRUE) ,
state_deductible = max(deductible) ) , by = state ]
Deleting variables # Deleing new_premium variable
dt[ , new_premium := NULL]
# Deleting state variables
dt[ , grep("state_" , names(dt)) := NULL ]
#Deleting first two columns
dt[ , (1:2) := NULL ]
Reshaping data table
From wide to long format – melt function
Syntax:
melt(data, id.vars, measure.vars,
variable.name = "variable",
value.name = "value")
Parameter Description
id.vars ID columns with IDs for multiple entries
measure.vars Columns containing values to fill into cells
variable.name and value.name Names of new columns for variables and values derived from old headers
From long to wide format – dcast function
Syntax:
dcast(dt, id ~ y, value.var = c("a", "b"))
Parameter Description
Id ~ y Formula with a LHS: ID columns containing IDs for multiple entries. And a RHS:
columns with values to spread in column headers
Value.var Columns containing values to fill into cells
Reshaping data table
Wide format Long format
Syntax:
wide_data <- as.data.table(fread("reshape_data.csv"))
long_data <- melt(wide_data, id.vars = c("policy_no" , "state" , "subln_grp"), measure.vars = c("written_premium", "endorsement_premium"),
variable.name = "premium_type" , value.name = "premium")
wide_data <- dcast(long_data , policy_no + state + subln_grp ~ premium_type , value.var = "premium")
Set() family in data table
Operation Syntax
Create or upadate column names by # Change the name of premium field to wrt_prem
reference setnames(dt, "premium", "wrt_prem")
# change names of multiple columns
setnames(dt, c("premium", "policy_no"), c("wrt_prem", "policy_id"))
Setting a key on data table # set policy_no as key
setkey(dt, policy_no)
Reorder columns by reference #Change the column order of the data table
setcolorder(dt, c("deductible", "coverage", "segment", "state", "subln_grp",
"policy_no", "premium"))
Binning variables in data table
Other operations in data table
Other operations in data table
Joins in data table
• Right join
Syntax:
right_join <- dt1[dt2, on = "policy_no“ ] # right join (dt1 is left, dt2 is right)
right_join <- merge(dt1, dt2 , by = "policy_no" , all.y = TRUE)
dt1 dt2
right_join
dt1 captures written premium and state dt2 captures coverage and location code
information
Right_join has all rows from dt2
Joins in data table
• Left join
Syntax:
left_join <- dt2[ dt1, on = "policy_no"] # left join (dt1 is left, dt2 is right)
left_join <- merge(dt1, dt2, by - "policy_no" , all.x = TRUE)
dt1 dt2
left_join
dt1 captures written premium and state dt2 captures coverage and location code
information
left_join has all rows from dt1
Joins in data table
• Inner join
Syntax:
inner_join <- dt1[ dt2, on = "policy_no", nomatch = 0] # inner join
inner_join <- merge(dt1, dt2, by = "policy_no")
dt1 dt2
inner_join
dt1 captures written premium and state dt2 captures coverage and location code
information
Inner_join has all rows where dt1’s key columns values match dt2’s key column values
Joins in data table
• Left anti join
Syntax:
left_anti_join <- dt1[ !dt2, on = "policy_no“ ]
• Right anti join
Syntax:
right_anti_join <- dt2[!dt1, on = "policy_no"]
• Full outer join
Syntax:
full_outer_join <- merge(dt1 , dt2 , by =
"policy_no" , all = TRUE)
Joins in data table
• Rolling join
• Rolling joins are used for analyzing data involving time
• Let’s say data table 1 (dt1) contains policy information like inception date, end date , written
premium and location
• Data table 2 (dt2) has policy endorsement details like endorsement effective date and associated
premium amount
dt1 dt2
Joins in data table
• Rolling join
• Rolling join can be used to merge both tables with the condition that endorsement effective date lies
within the policy period
Syntax:
# Converting character to date
dt1[ , `:=` (policy_start_date =
as.Date(policy_start_date, "%m/%d/%y") ,
policy_end_date =
as.Date(policy_end_date , "%m/%d/%y"))]
dt2[ , endorsement_eff_dt :=
as.Date(endorsement_eff_dt , "%m/%d/%y")]
# Creating the date variable to be used for joining
dt1[ , join_date := policy_start_date]
dt2[ , join_date := endorsement_eff_dt]
# setting keys on each table
setkey(dt1 , policy_no , join_date)
setkey(dt2 , policy_no , join_date)
# Rolling backward
rolling_join <- dt2[dt1 , roll = -365 ]
ggplot2 - Grammar of Graphics plot
Data: in ggplot2, data must be stored as an R data frame
Coordinate system: describes 2-D space that data is projected onto - for example, Cartesian
coordinates, polar coordinates, map projections
Geoms: describe type of geometric objects that represent data - for example, points, lines, polygons
Aesthetics: describe visual characteristics that represent data - for example, position, size, color,
shape, transparency, fill
Scales: for each aesthetic, describe how visual characteristic is converted to display values - for
example, log scales, color scales, size scales, shape scales
Stats : describe statistical transformations that typically summarize data - for example, counts,
means, medians, regression lines
Facets: describe how data is split into subsets and displayed as multiple small graphs
Creating a plot object
creates a plot object that can be assigned to a variable
can specify data frame and aesthetic mappings (visual characteristics that represent data)
Syntax:
plot_data <- as.data.table(fread("ggplot_data.csv"))
p <- ggplot(data = plot_data , aes(x = subln_grp , y = premium ))
p
x‐axis position indicates subln_grp
y‐axis position indicates premium
Adding a layer
Syntax:
plot_data <- as.data.table(fread("ggplot_data.csv"))
p <- ggplot(data = plot_data , aes(x = subln_grp , y = premium , color = state))
p + geom_point(size = 2)
Layer
Purpose:
Display the data – allows viewer to see patterns, overall structure, local structure, outliers
Display statistical summaries of the data – allows viewer to see counts, means, medians, IQRs, model
predictions
Data and aesthetics (mappings) may be inherited from ggplot() object or added, changed, or dropped
within individual layers
Most layers contain a geom - the fundamental building block of ggplot2
full specification: geom_xxx(mapping, data, stat, position, ...)
Each geom_xxx() has a default stat (statistical transformation) associated with it , but the default
statistical transformation can be changed using stat parameter
Adding a geom layer
Syntax:
plot_data <- as.data.table(fread("ggplot_data.csv"))
p <- ggplot(data = plot_data , aes(x = subln_grp , y = premium , color = state))
p + geom_blank() p + geom_point()
p + geom_jitter() p + geom_count()
Displaying Statistical Summary
Syntax:
plot_data <- as.data.table(fread("ggplot_data.csv"))
p <- ggplot(data = plot_data , aes(x = state))
p + geom_bar()
Already transformed data
Syntax:
plot_data <- as.data.table(fread("ggplot_data.csv"))
transfrmd_data <- plot_data[ , count := .N , by = state]
transfrmd_data <- unique(transfrmd_data[,.(count , state)])
p <- ggplot(data = transfrmd_data , aes(x = state , y = count))
p + geom_col()
# or
p + geom_bar(stat = "identity")
geom_bar: height of bar proportional to
number of observations in each group.
geom_col: leaves data as is. geom_bar
uses count stat by default. geom_col
uses identity stat
Displaying distributions
Syntax:
plot_data <- as.data.table(fread("ggplot_data.csv"))
p <- ggplot(data = plot_data , aes(x = premium))
p + geom_histogram() p + geom_freqpoly()
p + geom_freqpoly(aes(color = state))
Displaying Statistical Summaries
Syntax:
plot_data <- as.data.table(fread("ggplot_data.csv"))
p <- ggplot(data = plot_data , aes(x = state , y = premium))
p + geom_boxplot()
Position
Syntax:
plot_data <- as.data.table(fread("ggplot_data.csv"))
p <- ggplot(data = plot_data , aes(x = state , fill = deductible > 100))
p + geom_bar() p + geom_bar(position="stack")
p + geom_bar(position="dodge") p + geom_bar(position="fill")