Exploratory analysis using R & ggplot2

Introduction

The ggplot2 package is a plotting and graphics package written for R by Hadley Wickham. Its great looking plots and impressive flexibility have made it a popular amongst the R coding community. The syntax is rather different from other R graphics package allowing users to produce very creative plots with relatively small amounts of code.

In insurance, pricing models can become very complex and sometimes it is useful to have a tool like R to build graphs that are informative of the data structure. These can be useful not only in discussions within pricing teams but also when communicating ideas to non-technical people. Very often presenting the correct graph can save time.

The purpose of this post is to outline some exploratory plots that a pricing analyst might use when looking at data.

The Data

The data resides in the faraway R package and is called motorins. It contains claims data (Payment and perd), exposure data (Insured), number of claims (Claims) and some rating factors (Kilometres, Zone, Bonus, Make).

We first load the packges we need.

require(faraway) # the data source
require(ggplot2) # for plotting
require(gridExtra) # for arranging plots
require(scales) # for the plot scales

We can look at the data table:

head(motorins)
  Kilometres Zone Bonus Make Insured Claims Payment     perd
1          1    1     1    1  455.13    108  392491 3634.176
2          1    1     1    2   69.17     19   46221 2432.684
3          1    1     1    3   72.88     13   15694 1207.231
4          1    1     1    4 1292.39    124  422201 3404.847
5          1    1     1    5  191.01     40  119373 2984.325
6          1    1     1    6  477.66     57  170913 2998.474

Note that perd is the payment per claim (Payment/Claims).

We calculate the exposure weighted claims frequency (AveCount) and rename pred to AvePaid.

claims <- motorins
names(claims)[8] <- "AvePaid"
claims$AveCount <- with(claims, Claims/Insured)
claims$Bonus <- factor(claims$Bonus, ordered = TRUE)

More information on the data can be obtained by using ?motorins in the R interpreter.

Box plots for frequency and severity

We can start by looking at one-way box plots for frequency and severity. First we look at the Kilometres, categorical variable. It is an ordered factor for distance driven each year. The first part of the code uses the qplot() function to create a frequency boxplot(geom = "boxplot") for the frequency. The second part repeats the task for the severity, and the last part of the code simply arranges the plots that have been produced into a 2-column plot. We also use the theme() function to position the legend.

# Frequency box-plot
fp <- qplot(data = claims, x = Kilometres, y = AveCount, fill = Kilometres, 
 geom = "boxplot", ylab = "Average Claims Count\n") + theme(legend.position="bottom")

# Severity box-plot
sp <- qplot(data = claims, x = Kilometres, y = AvePaid, fill = Kilometres, 
 geom = "boxplot", ylab = "Average Severity\n") + theme(legend.position="bottom")

# Arranging the plots
grid.arrange(fp, sp, ncol = 2)

Clearly some transformation is necessary and so we plot on a log y-scale

# Scale transformation
fp <- fp + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),                                              labels = trans_format("log10", math_format(10^.x)))

sp <- sp + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x)))
grid.arrange(fp, sp, ncol = 2)

This is better. The trend in the data on the frequency side becomes clear.

We would want to see the mean as well. Here we use the geom_point() function to add the mean point to each box plot. Please note the use of the I() function. While using ggplot2 it is sometimes necessary to use the I() function to denote that you really do mean what you enter in the brackets.

# Adding the mean point to the box plot
fp <- fp + geom_point(stat = "summary", fun.y = "mean", size = I(3), color = I("black")) + 
  geom_point(stat = "summary", fun.y = "mean", size = I(2.2), color = I("orange"))

sp <- sp + geom_point(stat = "summary", fun.y = "mean", size = I(3), color = I("black")) + 
  geom_point(stat = "summary", fun.y = "mean", size = I(2.2), color = I("orange"))
grid.arrange(fp, sp, ncol = 2)

For convenience we can wrap this up as the R function below. In the factor below, we use the substitute(), deparse(), and eval()functions to mediate passing the rating factors that we are interested in into the qplot() function.

frSevBoxPlot <- function(rFactor = Kilometres, Data = claims){
  
  rFactor <- substitute(rFactor)
  
  fp <- qplot(data = Data, x = eval(rFactor, list(x = rFactor)), 
              y = AveCount, fill = eval(rFactor, list(fill = rFactor)), 
              geom = "boxplot", ylab = "Average Claims Count\n", xlab = deparse(rFactor)) + 
              theme(legend.position="bottom") + scale_fill_discrete(name = paste(deparse(rFactor), "  "))
  sp <- qplot(data = Data, x = eval(rFactor, list(x = rFactor)), 
              y = AvePaid, fill = eval(rFactor, list(fill = rFactor)),
              geom = "boxplot", ylab = "Average Severity\n", xlab = deparse(rFactor)) + 
    theme(legend.position="bottom") + scale_fill_discrete(name = paste(deparse(rFactor), "  "))
  fp <- fp + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x)))
  sp <- sp + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x)))
  fp <- fp + geom_point(stat = "summary", fun.y = "mean", size = I(3), color = I("black")) + 
    geom_point(stat = "summary", fun.y = "mean", size = I(2.2), color = I("orange"))
  sp <- sp + geom_point(stat = "summary", fun.y = "mean", size = I(3), color = I("black")) + 
    geom_point(stat = "summary", fun.y = "mean", size = I(2.2), color = I("orange"))
  
  grid.arrange(fp, sp, ncol = 2)
  
}

Now we can plot with impunity. The plots for Zone, Bonus, and Make factors are all below

frSevBoxPlot(rFactor = Zone)
frSevBoxPlot(rFactor = Bonus)
frSevBoxPlot(rFactor = Make)

From the above plots, claims frequency is far more interesting so for the rest of this blog, we will focus on that.

Two-way plots

There are lots of ways to look at the influence of two factors on a variable in ggplot2. One of these is to use the facet_grid() function. In the plot below, we look at the influence of Zone and Bonus on the claims frequency. Both trends are immediately clear. We have shifted to using the ggplot() function, which is a more formal way of defining the plots. In this case the geometry that we want to plot are defined as standalone functions, e.g. geom_boxplot().

svg(filename = paste(path, "7_Zone_Bonus_Two_Way.svg", sep = ""), width = 11, height = 7)
ggplot(claims, aes(x = Zone, y = AveCount, fill = Zone)) +  geom_boxplot() + 
  facet_grid(. ~ Bonus, labeller = label_both) + 
  scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), 
                labels = trans_format("log10", math_format(10^.x))) + 
  theme(legend.position="bottom") + labs(x = "\nZone", y = "Exposure weighted average Counts\n") + 
  geom_point(stat = "summary", fun.y = "mean", size = I(3), color = I("black")) + 
  geom_point(stat = "summary", fun.y = "mean", size = I(2.2), color = I("orange"))
dev.off()

An alternative way of presenting this information is using the violin plot. The advantage here is that the shape of the distribution is immediately evident.

ggplot(claims, aes(x = Zone, y = AveCount, fill = Zone)) +  geom_violin(trim = FALSE) + 
  facet_grid(. ~ Bonus, labeller = label_both) + 
  scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), 
                labels = trans_format("log10", math_format(10^.x))) + 
  theme(legend.position="bottom") + labs(x = "\nZone", y = "Exposure weighted average Counts\n") + 
  geom_point(stat = "summary", fun.y = "mean", size = I(3), color = I("black")) + 
  geom_point(stat = "summary", fun.y = "mean", size = I(2.2), color = I("orange"))

Histogram

Now we move to histograms which in this case are particularly interesting. This is because we will be plotting histograms of claim frequencies. On the face of it this can be slightly confusing (the idea of frequencies of frequencies) but the plots are also of value. Firstly, we plot a basic histogram.

qplot(AveCount, data = claims, geom = "histogram", y = ..density.., 
      binwidth = .04, colour = I("white"), fill = I("orange"), xlab = "\nExposure weighted average Counts", 
      ylab = "Density", main = "Histogram of average claim counts\n")

For each rating factor, we can produce plots where each frequency bar is apportioned to rating categories. Not only this we can also choose to represent each bar as proportions. Eyeballing the charts side by side, we can see where the claims are and which categories are represented time and again. We do this by defining position = "stack", to stack the bars together, and position = "fill" to represent each bar's proportion categories.

We rap our plotting code into a function

oneWayPlot <- function(rFactor = Kilometres, position = stack){
  
  rFactor <- substitute(rFactor)
  position <- deparse(substitute(position))
  
  switch(position, fill = {ylab = "Proportions"; main = ylab}, {ylab = "Denstiy"; main = "Histogram"})
  ylab <- paste(ylab, "\n")
  main <- paste(main, "of claim counts by", rFactor)
  
  tplot <- qplot(AveCount, data = claims, geom = "histogram", y = ..density.., binwidth = .04, 
                 fill = eval(rFactor, list(fill = rFactor)), ylab = ylab, xlab = "\nExposure weighted average Counts",
                 main = paste(main, "\n"), position = position) + scale_fill_discrete(name = paste(deparse(rFactor), "  ")) + 
    theme(legend.position="bottom")
  
  return(tplot)
}

and then output the plots

# Kilometers
grid.arrange(oneWayPlot(Kilometres, stack), oneWayPlot(Kilometres, fill), ncol = 2)
# Zone
grid.arrange(oneWayPlot(Zone, stack), oneWayPlot(Zone, fill), ncol = 2)
# Bonus
grid.arrange(oneWayPlot(Bonus, stack), oneWayPlot(Bonus, fill), ncol = 2)
# Make
grid.arrange(oneWayPlot(Make, stack), oneWayPlot(Make, fill), ncol = 2)

Summary

Using ggplot2 can clearly produce very interesting and informative graphics. There are few statistical programs like R that have a great potential to change the way that analysis is presented and carried out in the insurance industry. I hope that this post will encourage more actuarial analysts to have a go at R.

Data Analytics - Chasing the infinity

Tuesday, 27 September 2016