Data Analytics - Chasing the infinity : September 2016

Tuesday, 27 September 2016

Data Visualization in Python: Exploring Seaborn

Introduction:

Seaborn is a Python data visualization library with an emphasis on statistical plots. The library is an excellent resource for common regression and distribution plots, but where Seaborn really shines is in its ability to visualize many different features at once.

In this post, we'll cover three of Seaborn's most useful functions: factorplot, pairplot, and jointgrid. Going a step further, we'll show how we can get even more mileage out of these functions by stepping up to their even-more-powerful forms: FacetGrid, PairGrid, and JointGrid.

The Data

To showcase Seaborn, we'll use the UCI "Auto MPG" data set.

We did a bit of preprocessing of the data and made it ready for the analysis.

`factorplot` and `FacetGrid`

One of the most powerful features of Seaborn is the ability to easily build conditional plots; this let's us see what the data look like when segmented by one or more variables. The easiest way to do this is thorugh factorplot. Let's say that we we're interested in how cars' MPG has varied over time. Not only can we easily see this in aggregate:

sns.factorplot(data=df, x="model_year", y="mpg")

But we can also segment by, say, region of origin:

sns.factorplot(data=df, x="model_year", y="mpg", col="origin")

What's so great factorplot is that rather than having to segment the data ourselves and make the conditional plots individually, Seaborn provides a convenient API for doing it all at once.

The FacetGrid object is a slightly more complex, but also more powerful, take on the same idea. Let's say that we wanted to see KDE plots of the MPG distributions, separated by country of origin:

g = sns.FacetGrid(df, col="origin")  
g.map(sns.distplot, "mpg")

Or let's say that we wanted to see scatter plots of MPG against horsepower with the same origin segmentation:

g = sns.FacetGrid(df, col="origin")  
g.map(plt.scatter, "horsepower", "mpg")

Using FacetGrid, we can map any plotting function onto each segment of our data. For example, above we gave plt.scatter to g.map, which tells Seaborn to apply the matplotlib plt.scatter function to each of segments in our data. We don't need to use plt.scatter, though; we can use any function that understands the input data. For example, we could draw regression plots instead:

g = sns.FacetGrid(df, col="origin")  
g.map(sns.regplot, "horsepower", "mpg")  
plt.xlim(0, 250)  
plt.ylim(0, 60)

We can even segment by multiple variables at once, spreading some along the rows and some along the columns. This is very useful for producing comparing conditional distributions across interacting segmentations:

df['tons'] = (df.weight/2000).astype(int)  
g = sns.FacetGrid(df, col="origin", row="tons")  
g.map(sns.kdeplot, "horsepower", "mpg")  
plt.xlim(0, 250)  
plt.ylim(0, 60)

`pairplot` and `PairGrid`

While factorplot and FacetGrid are for drawing conditional plots of segmented data, pairplot and PairGrid are for showing the interactions between variables. For our car data set, we know that MPG, horsepower, and weight are probably going to be related; we also know that both these variable values and their relationships with one another, might vary by country of origin. Let's visualize all of that at once:

g = sns.pairplot(df[["mpg", "horsepower", "weight", "origin"]], hue="origin", diag_kind="hist")  
for ax in g.axes.flat:  
    plt.setp(ax.get_xticklabels(), rotation=45)

As FacetGrid was a fuller version of factorplot, so PairGrid gives a bit more freedom on the same idea as pairplot by letting you control the individual plot types separately. Let's say, for example, that we're building regression plots, and we'd like to see both the original data and the residuals at once. PairGrid makes it easy:

g = sns.PairGrid(df[["mpg", "horsepower", "weight", "origin"]], hue="origin")  
g.map_upper(sns.regplot)  
g.map_lower(sns.residplot)  
g.map_diag(plt.hist)  
for ax in g.axes.flat:  
    plt.setp(ax.get_xticklabels(), rotation=45)
g.add_legend()  
g.set(alpha=0.5)

We were able to control three regions (the diagonal, the lower-left triangle, and the upper-right triangle) separately. Again, you can pipe in any plotting function that understands the data it's given.

`jointplot` and `JointGrid`

The final Seaborn objects we'll talk about are jointplot and JointGrid; these features let you easily view both a joint distribution and its marginals at once. Let's say, for example, that aside from being interested in how MPG and horsepower are distributed individually, we're also interested in their joint distribution:

sns.jointplot("mpg", "horsepower", data=df, kind='kde')

As before, JointGrid gives you a bit more control by letting you map the marginal and joint data separately. For example:

g = sns.JointGrid(x="horsepower", y="mpg", data=df)  
g.plot_joint(sns.regplot, order=2)  
g.plot_marginals(sns.distplot)

Exploratory analysis using R & ggplot2

Introduction

The ggplot2 package is a plotting and graphics package written for R by Hadley Wickham. Its great looking plots and impressive flexibility have made it a popular amongst the R coding community. The syntax is rather different from other R graphics package allowing users to produce very creative plots with relatively small amounts of code.

In insurance, pricing models can become very complex and sometimes it is useful to have a tool like R to build graphs that are informative of the data structure. These can be useful not only in discussions within pricing teams but also when communicating ideas to non-technical people. Very often presenting the correct graph can save time.

The purpose of this post is to outline some exploratory plots that a pricing analyst might use when looking at data.

The Data

The data resides in the faraway R package and is called motorins. It contains claims data (Payment and perd), exposure data (Insured), number of claims (Claims) and some rating factors (Kilometres, Zone, Bonus, Make).

We first load the packges we need.

require(faraway) # the data source
require(ggplot2) # for plotting
require(gridExtra) # for arranging plots
require(scales) # for the plot scales

We can look at the data table:

head(motorins)
  Kilometres Zone Bonus Make Insured Claims Payment     perd
1          1    1     1    1  455.13    108  392491 3634.176
2          1    1     1    2   69.17     19   46221 2432.684
3          1    1     1    3   72.88     13   15694 1207.231
4          1    1     1    4 1292.39    124  422201 3404.847
5          1    1     1    5  191.01     40  119373 2984.325
6          1    1     1    6  477.66     57  170913 2998.474

Note that perd is the payment per claim (Payment/Claims).

We calculate the exposure weighted claims frequency (AveCount) and rename pred to AvePaid.

claims <- motorins
names(claims)[8] <- "AvePaid"
claims$AveCount <- with(claims, Claims/Insured)
claims$Bonus <- factor(claims$Bonus, ordered = TRUE)

More information on the data can be obtained by using ?motorins in the R interpreter.

Box plots for frequency and severity

We can start by looking at one-way box plots for frequency and severity. First we look at the Kilometres, categorical variable. It is an ordered factor for distance driven each year. The first part of the code uses the qplot() function to create a frequency boxplot(geom = "boxplot") for the frequency. The second part repeats the task for the severity, and the last part of the code simply arranges the plots that have been produced into a 2-column plot. We also use the theme() function to position the legend.

# Frequency box-plot
fp <- qplot(data = claims, x = Kilometres, y = AveCount, fill = Kilometres, 
 geom = "boxplot", ylab = "Average Claims Count\n") + theme(legend.position="bottom")

# Severity box-plot
sp <- qplot(data = claims, x = Kilometres, y = AvePaid, fill = Kilometres, 
 geom = "boxplot", ylab = "Average Severity\n") + theme(legend.position="bottom")

# Arranging the plots
grid.arrange(fp, sp, ncol = 2)

Clearly some transformation is necessary and so we plot on a log y-scale

# Scale transformation
fp <- fp + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),                                              labels = trans_format("log10", math_format(10^.x)))

sp <- sp + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x)))
grid.arrange(fp, sp, ncol = 2)

This is better. The trend in the data on the frequency side becomes clear.

We would want to see the mean as well. Here we use the geom_point() function to add the mean point to each box plot. Please note the use of the I() function. While using ggplot2 it is sometimes necessary to use the I() function to denote that you really do mean what you enter in the brackets.

# Adding the mean point to the box plot
fp <- fp + geom_point(stat = "summary", fun.y = "mean", size = I(3), color = I("black")) + 
  geom_point(stat = "summary", fun.y = "mean", size = I(2.2), color = I("orange"))

sp <- sp + geom_point(stat = "summary", fun.y = "mean", size = I(3), color = I("black")) + 
  geom_point(stat = "summary", fun.y = "mean", size = I(2.2), color = I("orange"))
grid.arrange(fp, sp, ncol = 2)

For convenience we can wrap this up as the R function below. In the factor below, we use the substitute(), deparse(), and eval()functions to mediate passing the rating factors that we are interested in into the qplot() function.

frSevBoxPlot <- function(rFactor = Kilometres, Data = claims){
  
  rFactor <- substitute(rFactor)
  
  fp <- qplot(data = Data, x = eval(rFactor, list(x = rFactor)), 
              y = AveCount, fill = eval(rFactor, list(fill = rFactor)), 
              geom = "boxplot", ylab = "Average Claims Count\n", xlab = deparse(rFactor)) + 
              theme(legend.position="bottom") + scale_fill_discrete(name = paste(deparse(rFactor), "  "))
  sp <- qplot(data = Data, x = eval(rFactor, list(x = rFactor)), 
              y = AvePaid, fill = eval(rFactor, list(fill = rFactor)),
              geom = "boxplot", ylab = "Average Severity\n", xlab = deparse(rFactor)) + 
    theme(legend.position="bottom") + scale_fill_discrete(name = paste(deparse(rFactor), "  "))
  fp <- fp + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x)))
  sp <- sp + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x)))
  fp <- fp + geom_point(stat = "summary", fun.y = "mean", size = I(3), color = I("black")) + 
    geom_point(stat = "summary", fun.y = "mean", size = I(2.2), color = I("orange"))
  sp <- sp + geom_point(stat = "summary", fun.y = "mean", size = I(3), color = I("black")) + 
    geom_point(stat = "summary", fun.y = "mean", size = I(2.2), color = I("orange"))
  
  grid.arrange(fp, sp, ncol = 2)
  
}

Now we can plot with impunity. The plots for Zone, Bonus, and Make factors are all below

frSevBoxPlot(rFactor = Zone)
frSevBoxPlot(rFactor = Bonus)
frSevBoxPlot(rFactor = Make)

From the above plots, claims frequency is far more interesting so for the rest of this blog, we will focus on that.

Two-way plots

There are lots of ways to look at the influence of two factors on a variable in ggplot2. One of these is to use the facet_grid() function. In the plot below, we look at the influence of Zone and Bonus on the claims frequency. Both trends are immediately clear. We have shifted to using the ggplot() function, which is a more formal way of defining the plots. In this case the geometry that we want to plot are defined as standalone functions, e.g. geom_boxplot().

svg(filename = paste(path, "7_Zone_Bonus_Two_Way.svg", sep = ""), width = 11, height = 7)
ggplot(claims, aes(x = Zone, y = AveCount, fill = Zone)) +  geom_boxplot() + 
  facet_grid(. ~ Bonus, labeller = label_both) + 
  scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), 
                labels = trans_format("log10", math_format(10^.x))) + 
  theme(legend.position="bottom") + labs(x = "\nZone", y = "Exposure weighted average Counts\n") + 
  geom_point(stat = "summary", fun.y = "mean", size = I(3), color = I("black")) + 
  geom_point(stat = "summary", fun.y = "mean", size = I(2.2), color = I("orange"))
dev.off()

An alternative way of presenting this information is using the violin plot. The advantage here is that the shape of the distribution is immediately evident.

ggplot(claims, aes(x = Zone, y = AveCount, fill = Zone)) +  geom_violin(trim = FALSE) + 
  facet_grid(. ~ Bonus, labeller = label_both) + 
  scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), 
                labels = trans_format("log10", math_format(10^.x))) + 
  theme(legend.position="bottom") + labs(x = "\nZone", y = "Exposure weighted average Counts\n") + 
  geom_point(stat = "summary", fun.y = "mean", size = I(3), color = I("black")) + 
  geom_point(stat = "summary", fun.y = "mean", size = I(2.2), color = I("orange"))

Histogram

Now we move to histograms which in this case are particularly interesting. This is because we will be plotting histograms of claim frequencies. On the face of it this can be slightly confusing (the idea of frequencies of frequencies) but the plots are also of value. Firstly, we plot a basic histogram.

qplot(AveCount, data = claims, geom = "histogram", y = ..density.., 
      binwidth = .04, colour = I("white"), fill = I("orange"), xlab = "\nExposure weighted average Counts", 
      ylab = "Density", main = "Histogram of average claim counts\n")

For each rating factor, we can produce plots where each frequency bar is apportioned to rating categories. Not only this we can also choose to represent each bar as proportions. Eyeballing the charts side by side, we can see where the claims are and which categories are represented time and again. We do this by defining position = "stack", to stack the bars together, and position = "fill" to represent each bar's proportion categories.

We rap our plotting code into a function

oneWayPlot <- function(rFactor = Kilometres, position = stack){
  
  rFactor <- substitute(rFactor)
  position <- deparse(substitute(position))
  
  switch(position, fill = {ylab = "Proportions"; main = ylab}, {ylab = "Denstiy"; main = "Histogram"})
  ylab <- paste(ylab, "\n")
  main <- paste(main, "of claim counts by", rFactor)
  
  tplot <- qplot(AveCount, data = claims, geom = "histogram", y = ..density.., binwidth = .04, 
                 fill = eval(rFactor, list(fill = rFactor)), ylab = ylab, xlab = "\nExposure weighted average Counts",
                 main = paste(main, "\n"), position = position) + scale_fill_discrete(name = paste(deparse(rFactor), "  ")) + 
    theme(legend.position="bottom")
  
  return(tplot)
}

and then output the plots

# Kilometers
grid.arrange(oneWayPlot(Kilometres, stack), oneWayPlot(Kilometres, fill), ncol = 2)
# Zone
grid.arrange(oneWayPlot(Zone, stack), oneWayPlot(Zone, fill), ncol = 2)
# Bonus
grid.arrange(oneWayPlot(Bonus, stack), oneWayPlot(Bonus, fill), ncol = 2)
# Make
grid.arrange(oneWayPlot(Make, stack), oneWayPlot(Make, fill), ncol = 2)

Summary

Using ggplot2 can clearly produce very interesting and informative graphics. There are few statistical programs like R that have a great potential to change the way that analysis is presented and carried out in the insurance industry. I hope that this post will encourage more actuarial analysts to have a go at R.