Introduction:
Seaborn is a Python data visualization library with an emphasis on statistical plots. The library is an excellent resource for common regression and distribution plots, but where Seaborn really shines is in its ability to visualize many different features at once.
In this post, we'll cover three of Seaborn's most useful functions:
factorplot
, pairplot
, and jointgrid
. Going a step further, we'll show how we can get even more mileage out of these functions by stepping up to their even-more-powerful forms: FacetGrid
, PairGrid
, and JointGrid
.The Data
To showcase Seaborn, we'll use the UCI "Auto MPG" data set.
We did a bit of preprocessing of the data and made it ready for the analysis.
factorplot
and FacetGrid
One of the most powerful features of Seaborn is the ability to easily build conditional plots; this let's us see what the data look like when segmented by one or more variables. The easiest way to do this is thorugh
factorplot
. Let's say that we we're interested in how cars' MPG has varied over time. Not only can we easily see this in aggregate:sns.factorplot(data=df, x="model_year", y="mpg")
But we can also segment by, say, region of origin:
sns.factorplot(data=df, x="model_year", y="mpg", col="origin")
What's so great
factorplot
is that rather than having to segment the data ourselves and make the conditional plots individually, Seaborn provides a convenient API for doing it all at once.
The
FacetGrid
object is a slightly more complex, but also more powerful, take on the same idea. Let's say that we wanted to see KDE plots of the MPG distributions, separated by country of origin:g = sns.FacetGrid(df, col="origin")
g.map(sns.distplot, "mpg")
Or let's say that we wanted to see scatter plots of MPG against horsepower with the same origin segmentation:
g = sns.FacetGrid(df, col="origin")
g.map(plt.scatter, "horsepower", "mpg")
Using
FacetGrid
, we can map any plotting function onto each segment of our data. For example, above we gave plt.scatter
to g.map
, which tells Seaborn to apply the matplotlib plt.scatter
function to each of segments in our data. We don't need to use plt.scatter
, though; we can use any function that understands the input data. For example, we could draw regression plots instead:g = sns.FacetGrid(df, col="origin")
g.map(sns.regplot, "horsepower", "mpg")
plt.xlim(0, 250)
plt.ylim(0, 60)
We can even segment by multiple variables at once, spreading some along the rows and some along the columns. This is very useful for producing comparing conditional distributions across interacting segmentations:
df['tons'] = (df.weight/2000).astype(int)
g = sns.FacetGrid(df, col="origin", row="tons")
g.map(sns.kdeplot, "horsepower", "mpg")
plt.xlim(0, 250)
plt.ylim(0, 60)
pairplot
and PairGrid
While
factorplot
and FacetGrid
are for drawing conditional plots of segmented data, pairplot
and PairGrid
are for showing the interactions between variables. For our car data set, we know that MPG, horsepower, and weight are probably going to be related; we also know that both these variable values and their relationships with one another, might vary by country of origin. Let's visualize all of that at once:g = sns.pairplot(df[["mpg", "horsepower", "weight", "origin"]], hue="origin", diag_kind="hist")
for ax in g.axes.flat:
plt.setp(ax.get_xticklabels(), rotation=45)
As
FacetGrid
was a fuller version of factorplot
, so PairGrid
gives a bit more freedom on the same idea as pairplot
by letting you control the individual plot types separately. Let's say, for example, that we're building regression plots, and we'd like to see both the original data and the residuals at once. PairGrid
makes it easy:g = sns.PairGrid(df[["mpg", "horsepower", "weight", "origin"]], hue="origin")
g.map_upper(sns.regplot)
g.map_lower(sns.residplot)
g.map_diag(plt.hist)
for ax in g.axes.flat:
plt.setp(ax.get_xticklabels(), rotation=45)
g.add_legend()
g.set(alpha=0.5)
We were able to control three regions (the diagonal, the lower-left triangle, and the upper-right triangle) separately. Again, you can pipe in any plotting function that understands the data it's given.
jointplot
and JointGrid
The final Seaborn objects we'll talk about are
jointplot
and JointGrid
; these features let you easily view both a joint distribution and its marginals at once. Let's say, for example, that aside from being interested in how MPG and horsepower are distributed individually, we're also interested in their joint distribution:sns.jointplot("mpg", "horsepower", data=df, kind='kde')
As before,
JointGrid
gives you a bit more control by letting you map the marginal and joint data separately. For example:g = sns.JointGrid(x="horsepower", y="mpg", data=df)
g.plot_joint(sns.regplot, order=2)
g.plot_marginals(sns.distplot)