Data Analytics - Chasing the infinity : August 2016

Friday, 5 August 2016

U Churn, I catch.... With Regards, Python

Its always a heartbreak for a company when a customer churns. It not only hurts the revenue of the company but also creates the thought of viability of the products in the market leading to the rise of unanswered questions about the product. Certain times, it better to find the customer who are churned well in advance rather than ignoring the small group as it may soar into a bubble. It not only helps the company to bring back their churned customer but also the insights got during the process helps the company to understand the behavior of the customer.
And who brings out such insights, oh yes the analytics folks who toil through number and get more exciting information from the underlying data.
In this blog, we are going to are going to discuss about the analyzes and prediction of the churn customer from the telecom dataset. We are going to perform a predictive modelling using Random Forest classifier which will distinguish the churned customer from the rest of the active customers.

Flow of the modelling process:

Preparation of dataset.
Exploratory Data Analysis.
Preprocessing and Feature Engineering.
Model Cross Validation.
Analyzing the model performance.
Prediction using Random Forest classifier.

Preparation of dataset:

The dataset is from the telecom industry which will be used for our modelling. Our first step is to import the libraries which are useful for the data analysis. The libraries which we use are pandas, Numpy. The next step is to import the data into the jupyter Notebook. We just get the grasp of the data here by view of 5 rows of the dataset.

We will also be interested to check the structure of the data i.e. the dimension of the dataset as it may be influential at times in selection of the modelling technique. We use the shape function to find the dataset dimensions.

From the result we came to know about the dimension of the data, but we want to know about the variables involved in the dataset in order to proceed with the process.

Now we have got enough information about the dataset which we can correlate with our domain knowledge to understand the importance of each predictors. We can also see that lot of variable names have unwanted apostrophes which causes some discomfort when we use it for analyzes. We will now rename all the variable names according to our comfort. we will use the function rename that can be applied to the pandas dataframe.

After renaming the data, we can use the head function to get the glimpse of few rows of the cleaned datasets.

And the resultant looks far better than the uncleaned dataset which we initially imported.

From the initial observation of the dataset we also find that there are different data types, numerical and categorical that are present in the dataset.

We particularly analyze 3 variables- IntlPlan, VMail and Churn, which are the categorical variables in the dataset.

Since these variables are categorical in nature we change them into a numerical value using the dummy variable coding. Dummy variables are numerical codes which are used to convert the categorical variables in such a fashion that it has some numerical value.

We can observe that the text values are convert into numerical coding as 0's and 1's respectively.

And the data preparation is done with care such that we don't miss any important predictors which will be helpful in building the model.

Exploratory Data Analysis:

Exploratory data analysis is one of the important step to get the information about the data which will be very useful in our model building. It at times help to conclude some of the assumptions that can be made about the data given. It is always recommendable to perform the analysis of each features before getting into the modelling part.

We use the describe function on the pandas dataframe to get the information about the variables involved. Information about the data : min, max, mean, count, IQR, standard deviation can be easily obtained for each variable using the describe function.

From the information about the churn variable we can infer that on an average 14.4 % of the customers churn on a regular basis.

Next thing which we can see is the churn with respect to the another variable. For example, do people with an International Plan (IntlPlan) have a higher average churn than without? To get these insights we will use two functions. The first one is a function named crosstab that you can use within Pandas to count the number of instances or samples based on two dimensions. So in this case we want to see how many people have an international plan and how many churn.

A more useful information can be obtained using the groupby function on the pandas dataframe. This function is used to group the data by one feature and perform functions on the other features.

We can immediately see there is a clear difference in the average churn rate between people with and without an international plan (42.4% versus 11.5%).

A more structured format of the above operation is by using a Pivot Table. We can create the same type of overview table using the Pandas function pivot_table. You need to specify what values you want to calculate your functions on and what your dimensions are to split the data on. To get some more insight we will now create a pivot table for both the features IntlPlan and VMPlan.

It would not been a great data analysis if we miss some treat to our eyes through data Visualization. Matplotlib and Seaborn are the two colorful stars of our show.
With the pivot_table and groupby function we can get the numerical information and if we want to use the information to be explained visually we can use the above data visualization libraries.

It makes sense to look at how the distribution of the data differs for the people that churn and the people that don’t churn. We can do this in two ways.

Firstly if a variable has a small number of different values/categories we could create a similar bar-chart (factorplot) for each of those values representing the average churn-rate for that specific value. We’ll try this for the CustServCalls variable which had a range from 0 to 9.

The above information can be visually described using the barplot function.

This method seems to work well when a feature may have a limited range of values and when those values are also quite discrete. When dealing with larger ranges and more continuous features these graphs will quickly become less informative. This is where it makes more sense to look at the distribution or kernel-density of a variable. For these visualisations we will use the violinplot from the seaborn library.

From these graphs the most interesting ones may be the DayMins and DayCharge. Obviously these two will be related since the charge is most likely determined by the minutes used and some tariff. From the above graphs it is interesting to note that they seem to have a bimodal distribution with a higher maximum. This may indicate that very active users are more likely to churn, perhaps because they are always looking for a better package/deal or because they are more affected by poor service or not meeting their expectations.

Preprocessing and Feature Engineering:

One of the most important stages in data analysis is probably the pre-processing and feature engineering stage. At this stage you will consider how you can use existing information to extract important features for prediction. This is often considered more of an art than a science and it usually requires some good understanding of the data, the domain/industry and the problem:

Coming up with features is difficult, time-consuming, requires expert knowledge. “Applied machine learning” is basically feature engineering. — Andrew Ng

Pre-processing essentially focuses on two problems: (1) data quality and (2) data representation. The first focuses on the quality of your dataset and potential problems with errors in the data itself. This often results in discarding data or adding values based on imputation or interpolation. The second issue deals with the question of how you transform your data such that it will work best with your algorithms.

When looking at all the other features we have three different features around the usage per part of the day (day, eve, night). We know how many minutes people have used their phone, how many calls they have made and how much they were charged for that activity. The interesting part is that two of those variables seem to depend on an other one of the variables.

You can argue that the number of minutes if most likely related to the number of calls made. Calls say something about activity, but the minutes per call also give an indication about activity. So I decided to create a new variable that has the average minutes called per call.
The charge is most likely a function of the number of minutes and some tariff applied to that. So it makes sense to create a feature that represents the tariff people were charged on.

This code will nicely generate the new features, but it will also insert NaN’s due to dividing by zero. To deal with this we have to make sure to set all of the NaN values to zero.

We have an existing variable that has the number of voicemail messages. Since it also contains the number 0 for some samples it can be hard for some algorithms to assign any importance to that as it would be multiplying with a 0. It can be wise to create a new variable that tells us if the number of voicemail messages is equal to zero. We’ll label this variable NoVMMessages and it will have a value of 1 if there are no voicemail messages and 0 otherwise.

Because we have the latitude and longitude for each state now we can also drop the column that has the state code.

We now have all the variables we need to work with. To make it easier to use in some models later on we’ll now convert the Pandas DataFrame to a Numpy matrix. We’ll create a vector y holding the churn-labels and a matrix X with all the explanatory variables.

This gives us a matrix to work with. The big problem now is that a lot of the values are in completely different ranges. Remember that features like NoVMMessages are 0or 1 whereas a feature like IntlMins ranges from 0 to a value in the hundreds. Such differences in ranges for different features may cause problems in some models. Models that depend on calculating some distance between samples can be thrown off by the scaling of the individual features. To account for this we are going to rescale all features and standardize them. This boils down to subtracting their invidual means and dividing by the individual standard deviations.

For this we can use the Scikit-Learn library and its StandardScaler function. The documentation for this function re-iterates the previous point stating that:

Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual feature do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).

Cross-Validation

Before trying to predict the customer churn using any models we need to give our cross-validation framework some thought. Cross-validation) is defined by Wikipedia as

a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set.

It serves the purpose of testing how well the model performs on data it has not seen before. It is a great way to get a feel for (a) how much overfitting is going on in the training of the model and (b) how robust the model is when facing uncertainty in terms of new data it encounters.

One way to approach cross-validation is by cutting up the dataset into a number of chunks (also known as folds). You then take the first chunk and keep it on the side. You take all the remaining data and train your model using that data. You then use your trained model to predict the chunk that you initially kept out and on the side. These predictions will give you a feel for how well your model can deal with that chunk of data that is has not seen before. Now repeat this process for each chunk that you have available and at the end you will have predictions for each datapoint in your dataset.

There are a number of different techniques for cross-validation and each problem will probably justify its own cross-validation procedure. To not overcomplicate things we’ll stick with a very basic approach here that splits the dataset up in a number of blocks. Scikit-Learn has a great toolkit for cross-validation that we can draw from.

We now have our cross-validation framework. This will, once given the data and a classifier, predict for each user if they are going to churn and the probability that classification is based on.

Performance:
We could simply calculate how many of the predicted users we have predicted correctly: the accuracy. We could also look at how many of the actual churners we have correctly predicted as those are the customers that we don’t want to loose! This measure is usually referred to as the recall of the model. Here again Scikit-Learn has a range of the performance metrics built-in.

Prediction Using Random Forest Classifier:

For our predictive model we will use the Scikit-Learn implementation of a Random Forest Classifier. This models gives us some parameters we can use to increase the size and depth of the model. What we need to do next is use the cross-validation framework we built previously and give it the data and model of our choice.

This is already a great result. It accurately predicts 94.2% of all customers! However, most of those are not churning and are therefore less interesting. If we look at theconfusion-matrix that we created of the predictions versus the actual results we can see that we only correctly pick out 307 churning customers, but fail to identify the other 176 customers who do churn as well. This is correctly represented by the recallmetric, which shows us we are only 63.56% good on the customers we actually want to be most accurate on.

Lets take a look and see how the model performs if we increase the number of estimators (the size) of the model.

Wednesday, 3 August 2016

Keep Calm n Lets do Baseball Analytics- Sabermetrics- Python

When it comes to data analytics I keep looking for a domain where it is not applied rather than where it is applied. And my search seems to be a never ending While-loop as the condition of finding a domain where analytics doesn't play role is far visible. Such is the astonishment when one of my professor told about the application of analytics in Baseball. I was pondered with some bewildered thought and murmured in such low decibel, that is barely heard even by me "Is he kidding ? ", Its a game which is played using bat and ball and the performance of a player depends upon the his ability and there is no way one can predict which player is gonna rock the field on what time.

Unconvinced and curious about the case I found that its possible. Awestruck I went through the blog which was provided by my tech-savvy professor on how analytics solve the mystery of analyzing the players performance through machine learning concept using the powerful programming tool "PYTHON"(And Yeah that's the favorite pet in the world of analytics at the present.).

Before jumping into the baseball analytics, I want to talk about something called SABERMETRICS (I wish you are terrified by the word as I was too.). Its nothing but a fashionable way of telling baseball analytics. (Yeah people always love to hear complex things.). To throw some focus light to our star "Sabermetrics", let me get some help from the guy who knows almost everything (Oops, am not a male chauvinist.Please excuse.. ), the great WIKIPEDIA, and here he defines,

Sabermetrics is the empirical analysis of baseball, especially baseball statistics that measure in-game activity. It answers almost every questions pertaining to baseball, from how much approximately a team can score to which player a team can select for the next to have a better chance of getting a desired result.

Google is far more better place to know more about Sabermetrics (Let me not let me down, I gave some introduction and going to do a step by step analysis using python). . Here we are going to discuss the famous case of sabermetrics, which involves the Oakland Athletics managed by Billy Beane following the analytical strategies designed by mastermind Paul DePodesta. To simply things, Paul was the guy who used sabermetrics techniques to analyse players and how the team has to perform to get the desired result.

Let's dive into the coding... Its time for action... Let the game begin..

1. Data Intro:

This dataset is a collection which contains the batting and pitching statistics from 1871 to 2013 (Source: Lahman Baseball). We will play with the data using Python 2.7 and the libraries which will help us in the analysis are Numpy, Scipy, Pandas, Matplotlib and Statsmodels. Let us get a small peek through about each libraries:

Numpy- NumPy is an open source extension module for Python. The module NumPy provides fast precompiled functions for numerical routines.

It adds support to Python for large, multi-dimensional arrays and matrices. Besides that it supplies a large library of high-level mathematical functions to operate on these arrays.

Scipy-SciPy is widely used in scientific and technical computing. SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE solvers and other tasks common in science and engineering.

Pandas-Pandas is a library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. Pandas is free software released under the three-clause BSD license.

Matplotlib- matplotlib is a plotting library for NumPy. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like wxPython, Qt, or GTK+.

Statsmodels- Statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator.

2. Getting the data into Jupyter Notebook:

We initially import all the libraries in a single cell and code them as per our convenience (because obvious a bit lazy, hope no one heard.)

We import the Teams.csv file using the Pandas library and then start our further analysis. The dataset contains 2745 observations and 48 attributes.

3. Lets analyze precise, Yeah Subset :

After going through all the attributes (impatiently) and upon coming to conclusion about the important attributes (15) that adds value to the model, a subset of the dataset is made. And yeah it contains data after year 1985.

The dataframe is then indexed using the yearId and the Team Id.

It just a step for easy access of the dataframe.

The salary.csv is imported into the jupyter notebook using the Pandas library.

This data frame contains the salaries of players from 1985 till 2013.

Using this dataset, payroll is calculated.

We can calculate the payroll for a particular team at a particular year.

The next step involves merging of the salary data set with the team dataset to get a overall view of the much needed information. The payroll data is now stored in a column called salary. Now we can check the payroll of the Oakland Athletics in 2001.

Lets get ourselves busy by generating some plots, a plot showing the relationship between salaries and the number of wins for the year 2001. The colourful matplotlib comes handy to generate such plots.

The following two functions are used to plot the relationship between salaries with labels and axis formating; as well as highlighting the Oakland Athletics, the New York Yankees, and the Boston Red Sox data.

We then run the plot function which we built by passing two arguments: team and the year and find the relationship between the salaries and the number of wins. This might help the amount of money invested on the players to get the desired result.

From the plot we can get an interesting result that Oakland Athletics are registering more number of wins with less paid players and out performing many teams which heavily invest on their players. Interesting !!!!

4. Bill Beane's Formula

For a Baseball team to win a game, it needs to score more runs than it allows. In the remaining of this tutorial, we will build a mathematical model for runs scored. Similar logic could be applied for modelling runs allowed.

Most teams focused on Batting Average (BA) as a statistic to improve their runs Scored. Bill Beane took a different approach, he focused on improving On Base Percentage (OBP), and Slugging Percentage (SLG).

The Batting Average is defined by the number of hits divided by at bats. It can be calculated using the formula below:

BA = H/AB

On-base Percentage is a measure of how often a batter reaches base for any reason other than a fielding error, fielder's choice, dropped/uncaught third strike, fielder's obstruction, or catcher's interference. It can be calculated using the formula below:

OBP = (H+BB+HBP)/(AB+BB+HBP+SF)

Slugging Percentage is a measure of the power of a hitter. It can ve calculated using the formula below:

SLG = H+2B+(2*3B)+(3*HR)/AB

(referred from Bill Beane's Formula)

The above three variables are added to the dataframe which will be very helpful in our model. We then build a linear regression model to predict the runs scored by the team which is one of the most important parameter in assessing the team performance.

Here we build 3 different model: one with BA, OBP and SLG, second with OBP and SLG and the third with the feature BA only. We use the exciting statsmodels library to build the regression model.

We can look at a summary statistic of these models by running:

The first model has an Adjusted R-squared of 0.918, with 95% confidence interval.

This is counterintuitive, since we expect the BA value to be positive. This is due to a multicollinearity (a high correlation existing between attributes) between the variables.

The second model has an Adjusted R-squared of 0.919, and the last model an Adjusted R-squared of 0.500. The AIC and BIC are also the lowest compared to the other model which gives a clear idea of the model which gives a accurate result.

Based on this analysis, we could confirm that the second model using OBP and SLG is the best model for predicting Run Scored.

Based on the analysis above, a good strategy for recruiting batters would focus on targeting undervalued players with high OBP and SLG. In the late 1990s, the old school scouts overvalued BA, and players with high BA had high salaries. Although BA and OBP have a positive correlation, there were some players that have high OBP and SLG, and relatively small BA. These players were undervalued by the market, and were the target of Billy Beane.

Reference:

http://adilmoujahid.com/posts/2014/07/baseball-analytics/
http://www.datasciencecentral.com/profiles/blogs/9-python-analytics-libraries-1