Introduction
The
ggplot2
package is a plotting and graphics package written for R by Hadley Wickham
. Its great looking plots and impressive flexibility have made it a popular amongst the R coding community. The syntax is rather different from other R graphics package allowing users to produce very creative plots with relatively small amounts of code.
In insurance, pricing models can become very complex and sometimes it is useful to have a tool like R to build graphs that are informative of the data structure. These can be useful not only in discussions within pricing teams but also when communicating ideas to non-technical people. Very often presenting the correct graph can save time.
The purpose of this post is to outline some exploratory plots that a pricing analyst might use when looking at data.
The Data
The data resides in the faraway R package and is called
motorins
. It contains claims data (Payment and perd
), exposure data (Insured
), number of claims (Claims
) and some rating factors (Kilometres
, Zone
, Bonus
, Make
). We first load the packges we need.
require(faraway) # the data source
require(ggplot2) # for plotting
require(gridExtra) # for arranging plots
require(scales) # for the plot scales
We can look at the data table:
head(motorins)
Kilometres Zone Bonus Make Insured Claims Payment perd
1 1 1 1 1 455.13 108 392491 3634.176
2 1 1 1 2 69.17 19 46221 2432.684
3 1 1 1 3 72.88 13 15694 1207.231
4 1 1 1 4 1292.39 124 422201 3404.847
5 1 1 1 5 191.01 40 119373 2984.325
6 1 1 1 6 477.66 57 170913 2998.474
Note that perd is the payment per claim (
We calculate the exposure weighted claims frequency (AveCount) and rename pred to AvePaid.
Payment/Claims
).We calculate the exposure weighted claims frequency (AveCount) and rename pred to AvePaid.
claims <- motorins
names(claims)[8] <- "AvePaid"
claims$AveCount <- with(claims, Claims/Insured)
claims$Bonus <- factor(claims$Bonus, ordered = TRUE)
More information on the data can be obtained by using
?motorins
in the R
interpreter.Box plots for frequency and severity
We can start by looking at one-way box plots for frequency and severity. First we look at the
Kilometres
, categorical variable. It is an ordered factor for distance driven each year. The first part of the code uses the qplot()
function to create a frequency boxplot(geom = "boxplot")
for the frequency
. The second part repeats the task for the severity
, and the last part of the code simply arranges the plots that have been produced into a 2-column plot. We also use the theme()
function to position the legend.# Frequency box-plot
fp <- qplot(data = claims, x = Kilometres, y = AveCount, fill = Kilometres,
geom = "boxplot", ylab = "Average Claims Count\n") + theme(legend.position="bottom")
# Severity box-plot
sp <- qplot(data = claims, x = Kilometres, y = AvePaid, fill = Kilometres,
geom = "boxplot", ylab = "Average Severity\n") + theme(legend.position="bottom")
# Arranging the plots
grid.arrange(fp, sp, ncol = 2)
Clearly some transformation is necessary and so we plot on a log y-scale
# Scale transformation
fp <- fp + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x)))
sp <- sp + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x)))
grid.arrange(fp, sp, ncol = 2)
This is better. The trend in the data on the frequency side becomes clear.
We would want to see the mean as well. Here we use the
geom_point()
function to add the mean point to each box plot. Please note the use of the I()
function. While using ggplot2
it is sometimes necessary to use the I()
function to denote that you really do mean what you enter in the brackets.# Adding the mean point to the box plot
fp <- fp + geom_point(stat = "summary", fun.y = "mean", size = I(3), color = I("black")) +
geom_point(stat = "summary", fun.y = "mean", size = I(2.2), color = I("orange"))
sp <- sp + geom_point(stat = "summary", fun.y = "mean", size = I(3), color = I("black")) +
geom_point(stat = "summary", fun.y = "mean", size = I(2.2), color = I("orange"))
grid.arrange(fp, sp, ncol = 2)
For convenience we can wrap this up as the R function below. In the factor below, we use the
substitute()
, deparse()
, and eval()
functions to mediate passing the rating factors that we are interested in into the qplot()
function.frSevBoxPlot <- function(rFactor = Kilometres, Data = claims){
rFactor <- substitute(rFactor)
fp <- qplot(data = Data, x = eval(rFactor, list(x = rFactor)),
y = AveCount, fill = eval(rFactor, list(fill = rFactor)),
geom = "boxplot", ylab = "Average Claims Count\n", xlab = deparse(rFactor)) +
theme(legend.position="bottom") + scale_fill_discrete(name = paste(deparse(rFactor), " "))
sp <- qplot(data = Data, x = eval(rFactor, list(x = rFactor)),
y = AvePaid, fill = eval(rFactor, list(fill = rFactor)),
geom = "boxplot", ylab = "Average Severity\n", xlab = deparse(rFactor)) +
theme(legend.position="bottom") + scale_fill_discrete(name = paste(deparse(rFactor), " "))
fp <- fp + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x)))
sp <- sp + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x)))
fp <- fp + geom_point(stat = "summary", fun.y = "mean", size = I(3), color = I("black")) +
geom_point(stat = "summary", fun.y = "mean", size = I(2.2), color = I("orange"))
sp <- sp + geom_point(stat = "summary", fun.y = "mean", size = I(3), color = I("black")) +
geom_point(stat = "summary", fun.y = "mean", size = I(2.2), color = I("orange"))
grid.arrange(fp, sp, ncol = 2)
}
Now we can plot with impunity. The plots for Zone, Bonus, and Make factors are all below
frSevBoxPlot(rFactor = Zone)
frSevBoxPlot(rFactor = Bonus)
frSevBoxPlot(rFactor = Make)
From the above plots, claims frequency is far more interesting so for the rest of this blog, we will focus on that.
Two-way plots
There are lots of ways to look at the influence of two factors on a variable in
ggplot2
. One of these is to use the facet_grid()
function. In the plot below, we look at the influence of Zone
and Bonus
on the claims frequency. Both trends are immediately clear. We have shifted to using the ggplot()
function, which is a more formal way of defining the plots. In this case the geometry that we want to plot are defined as standalone functions, e.g. geom_boxplot()
.svg(filename = paste(path, "7_Zone_Bonus_Two_Way.svg", sep = ""), width = 11, height = 7)
ggplot(claims, aes(x = Zone, y = AveCount, fill = Zone)) + geom_boxplot() +
facet_grid(. ~ Bonus, labeller = label_both) +
scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x))) +
theme(legend.position="bottom") + labs(x = "\nZone", y = "Exposure weighted average Counts\n") +
geom_point(stat = "summary", fun.y = "mean", size = I(3), color = I("black")) +
geom_point(stat = "summary", fun.y = "mean", size = I(2.2), color = I("orange"))
dev.off()
An alternative way of presenting this information is using the violin plot. The advantage here is that the shape of the distribution is immediately evident.
ggplot(claims, aes(x = Zone, y = AveCount, fill = Zone)) + geom_violin(trim = FALSE) +
facet_grid(. ~ Bonus, labeller = label_both) +
scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x))) +
theme(legend.position="bottom") + labs(x = "\nZone", y = "Exposure weighted average Counts\n") +
geom_point(stat = "summary", fun.y = "mean", size = I(3), color = I("black")) +
geom_point(stat = "summary", fun.y = "mean", size = I(2.2), color = I("orange"))
Histogram
Now we move to histograms which in this case are particularly interesting. This is because we will be plotting histograms of claim frequencies. On the face of it this can be slightly confusing (the idea of frequencies of frequencies) but the plots are also of value. Firstly, we plot a basic histogram.
qplot(AveCount, data = claims, geom = "histogram", y = ..density..,
binwidth = .04, colour = I("white"), fill = I("orange"), xlab = "\nExposure weighted average Counts",
ylab = "Density", main = "Histogram of average claim counts\n")
For each rating factor, we can produce plots where each frequency bar is apportioned to rating categories. Not only this we can also choose to represent each bar as proportions. Eyeballing the charts side by side, we can see where the claims are and which categories are represented time and again. We do this by defining
We rap our plotting code into a function
position = "stack"
, to stack the bars together, and position = "fill"
to represent each bar's proportion categories.We rap our plotting code into a function
oneWayPlot <- function(rFactor = Kilometres, position = stack){
rFactor <- substitute(rFactor)
position <- deparse(substitute(position))
switch(position, fill = {ylab = "Proportions"; main = ylab}, {ylab = "Denstiy"; main = "Histogram"})
ylab <- paste(ylab, "\n")
main <- paste(main, "of claim counts by", rFactor)
tplot <- qplot(AveCount, data = claims, geom = "histogram", y = ..density.., binwidth = .04,
fill = eval(rFactor, list(fill = rFactor)), ylab = ylab, xlab = "\nExposure weighted average Counts",
main = paste(main, "\n"), position = position) + scale_fill_discrete(name = paste(deparse(rFactor), " ")) +
theme(legend.position="bottom")
return(tplot)
}
and then output the plots
# Kilometers
grid.arrange(oneWayPlot(Kilometres, stack), oneWayPlot(Kilometres, fill), ncol = 2)
# Zone
grid.arrange(oneWayPlot(Zone, stack), oneWayPlot(Zone, fill), ncol = 2)
# Bonus
grid.arrange(oneWayPlot(Bonus, stack), oneWayPlot(Bonus, fill), ncol = 2)
# Make
grid.arrange(oneWayPlot(Make, stack), oneWayPlot(Make, fill), ncol = 2)
Summary
Using
ggplot2
can clearly produce very interesting and informative graphics. There are few statistical programs like R that have a great potential to change the way that analysis is presented and carried out in the insurance industry. I hope that this post will encourage more actuarial analysts to have a go at R
.
lots and lots of code ... but no images are visible
ReplyDelete