Its always a heartbreak for a company when a customer churns. It not only hurts the revenue of the company but also creates the thought of viability of the products in the market leading to the rise of unanswered questions about the product. Certain times, it better to find the customer who are churned well in advance rather than ignoring the small group as it may soar into a bubble. It not only helps the company to bring back their churned customer but also the insights got during the process helps the company to understand the behavior of the customer.
And who brings out such insights, oh yes the analytics folks who toil through number and get more exciting information from the underlying data.
In this blog, we are going to are going to discuss about the analyzes and prediction of the churn customer from the telecom dataset. We are going to perform a predictive modelling using Random Forest classifier which will distinguish the churned customer from the rest of the active customers.
Flow of the modelling process:
Preprocessing and Feature Engineering:
Performance:
And who brings out such insights, oh yes the analytics folks who toil through number and get more exciting information from the underlying data.
In this blog, we are going to are going to discuss about the analyzes and prediction of the churn customer from the telecom dataset. We are going to perform a predictive modelling using Random Forest classifier which will distinguish the churned customer from the rest of the active customers.
Flow of the modelling process:
- Preparation of dataset.
- Exploratory Data Analysis.
- Preprocessing and Feature Engineering.
- Model Cross Validation.
- Analyzing the model performance.
- Prediction using Random Forest classifier.
Preparation of dataset:
The dataset is from the telecom industry which will be used for our modelling. Our first step is to import the libraries which are useful for the data analysis. The libraries which we use are pandas, Numpy. The next step is to import the data into the jupyter Notebook. We just get the grasp of the data here by view of 5 rows of the dataset.
We will also be interested to check the structure of the data i.e. the dimension of the dataset as it may be influential at times in selection of the modelling technique. We use the shape function to find the dataset dimensions.
From the result we came to know about the dimension of the data, but we want to know about the variables involved in the dataset in order to proceed with the process.
Now we have got enough information about the dataset which we can correlate with our domain knowledge to understand the importance of each predictors. We can also see that lot of variable names have unwanted apostrophes which causes some discomfort when we use it for analyzes. We will now rename all the variable names according to our comfort. we will use the function rename that can be applied to the pandas dataframe.
After renaming the data, we can use the head function to get the glimpse of few rows of the cleaned datasets.
And the resultant looks far better than the uncleaned dataset which we initially imported.
From the initial observation of the dataset we also find that there are different data types, numerical and categorical that are present in the dataset.
We particularly analyze 3 variables- IntlPlan, VMail and Churn, which are the categorical variables in the dataset.
From the initial observation of the dataset we also find that there are different data types, numerical and categorical that are present in the dataset.
We particularly analyze 3 variables- IntlPlan, VMail and Churn, which are the categorical variables in the dataset.
Since these variables are categorical in nature we change them into a numerical value using the dummy variable coding. Dummy variables are numerical codes which are used to convert the categorical variables in such a fashion that it has some numerical value.
We can observe that the text values are convert into numerical coding as 0's and 1's respectively.
And the data preparation is done with care such that we don't miss any important predictors which will be helpful in building the model.
We can observe that the text values are convert into numerical coding as 0's and 1's respectively.
And the data preparation is done with care such that we don't miss any important predictors which will be helpful in building the model.
Exploratory Data Analysis:
Exploratory data analysis is one of the important step to get the information about the data which will be very useful in our model building. It at times help to conclude some of the assumptions that can be made about the data given. It is always recommendable to perform the analysis of each features before getting into the modelling part.
We use the describe function on the pandas dataframe to get the information about the variables involved. Information about the data : min, max, mean, count, IQR, standard deviation can be easily obtained for each variable using the describe function.
From the information about the churn variable we can infer that on an average 14.4 % of the customers churn on a regular basis.
Next thing which we can see is the churn with respect to the another variable. For example, do people with an International Plan (
IntlPlan
) have a higher average churn than without? To get these insights we will use two functions. The first one is a function named crosstab that you can use within Pandas to count the number of instances or samples based on two dimensions. So in this case we want to see how many people have an international plan and how many churn.
A more useful information can be obtained using the groupby function on the pandas dataframe. This function is used to group the data by one feature and perform functions on the other features.
We can immediately see there is a clear difference in the average churn rate between people with and without an international plan (42.4% versus 11.5%).
A more structured format of the above operation is by using a Pivot Table. We can create the same type of overview table using the Pandas function pivot_table. You need to specify what values you want to calculate your functions on and what your dimensions are to split the data on. To get some more insight we will now create a pivot table for both the features IntlPlan and VMPlan.
It would not been a great data analysis if we miss some treat to our eyes through data Visualization. Matplotlib and Seaborn are the two colorful stars of our show.
With the pivot_table and groupby function we can get the numerical information and if we want to use the information to be explained visually we can use the above data visualization libraries.
With the pivot_table and groupby function we can get the numerical information and if we want to use the information to be explained visually we can use the above data visualization libraries.
It makes sense to look at how the distribution of the data differs for the people that churn and the people that don’t churn. We can do this in two ways.
Firstly if a variable has a small number of different values/categories we could create a similar bar-chart (factorplot) for each of those values representing the average churn-rate for that specific value. We’ll try this for the
CustServCalls
variable which had a range from 0
to 9
.
The above information can be visually described using the barplot function.
This method seems to work well when a feature may have a limited range of values and when those values are also quite discrete. When dealing with larger ranges and more continuous features these graphs will quickly become less informative. This is where it makes more sense to look at the distribution or kernel-density of a variable. For these visualisations we will use the violinplot from the seaborn library.
From these graphs the most interesting ones may be the
DayMins
and DayCharge
. Obviously these two will be related since the charge is most likely determined by the minutes used and some tariff. From the above graphs it is interesting to note that they seem to have a bimodal distribution with a higher maximum. This may indicate that very active users are more likely to churn, perhaps because they are always looking for a better package/deal or because they are more affected by poor service or not meeting their expectations.
Preprocessing and Feature Engineering:
One of the most important stages in data analysis is probably the pre-processing and feature engineering stage. At this stage you will consider how you can use existing information to extract important features for prediction. This is often considered more of an art than a science and it usually requires some good understanding of the data, the domain/industry and the problem:
Coming up with features is difficult, time-consuming, requires expert knowledge. “Applied machine learning” is basically feature engineering. — Andrew Ng
Pre-processing essentially focuses on two problems: (1) data quality and (2) data representation. The first focuses on the quality of your dataset and potential problems with errors in the data itself. This often results in discarding data or adding values based on imputation or interpolation. The second issue deals with the question of how you transform your data such that it will work best with your algorithms.
When looking at all the other features we have three different features around the usage per part of the day (day, eve, night). We know how many minutes people have used their phone, how many calls they have made and how much they were charged for that activity. The interesting part is that two of those variables seem to depend on an other one of the variables.
- You can argue that the number of minutes if most likely related to the number of calls made. Calls say something about activity, but the minutes per call also give an indication about activity. So I decided to create a new variable that has the average minutes called per call.
- The charge is most likely a function of the number of minutes and some tariff applied to that. So it makes sense to create a feature that represents the tariff people were charged on.
This code will nicely generate the new features, but it will also insert NaN’s due to dividing by zero. To deal with this we have to make sure to set all of the NaN values to zero.
We have an existing variable that has the number of voicemail messages. Since it also contains the number
0
for some samples it can be hard for some algorithms to assign any importance to that as it would be multiplying with a 0
. It can be wise to create a new variable that tells us if the number of voicemail messages is equal to zero. We’ll label this variable NoVMMessages and it will have a value of 1
if there are no voicemail messages and 0
otherwise.
Because we have the latitude and longitude for each state now we can also drop the column that has the state code.
We now have all the variables we need to work with. To make it easier to use in some models later on we’ll now convert the Pandas DataFrame to a Numpy matrix. We’ll create a vector
y
holding the churn-labels and a matrix X
with all the explanatory variables.
This gives us a matrix to work with. The big problem now is that a lot of the values are in completely different ranges. Remember that features like
NoVMMessages
are 0
or 1
whereas a feature like IntlMins
ranges from 0
to a value in the hundreds. Such differences in ranges for different features may cause problems in some models. Models that depend on calculating some distance between samples can be thrown off by the scaling of the individual features. To account for this we are going to rescale all features and standardize them. This boils down to subtracting their invidual means and dividing by the individual standard deviations.
For this we can use the
Scikit-Learn
library and its StandardScaler function. The documentation for this function re-iterates the previous point stating that:
Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual feature do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).
Cross-Validation
Before trying to predict the customer churn using any models we need to give our cross-validation framework some thought. Cross-validation) is defined by Wikipedia as
a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set.
It serves the purpose of testing how well the model performs on data it has not seen before. It is a great way to get a feel for (a) how much overfitting is going on in the training of the model and (b) how robust the model is when facing uncertainty in terms of new data it encounters.
One way to approach cross-validation is by cutting up the dataset into a number of chunks (also known as folds). You then take the first chunk and keep it on the side. You take all the remaining data and train your model using that data. You then use your trained model to predict the chunk that you initially kept out and on the side. These predictions will give you a feel for how well your model can deal with that chunk of data that is has not seen before. Now repeat this process for each chunk that you have available and at the end you will have predictions for each datapoint in your dataset.
There are a number of different techniques for cross-validation and each problem will probably justify its own cross-validation procedure. To not overcomplicate things we’ll stick with a very basic approach here that splits the dataset up in a number of blocks. Scikit-Learn has a great toolkit for cross-validation that we can draw from.
We now have our cross-validation framework. This will, once given the data and a classifier, predict for each user if they are going to churn and the probability that classification is based on.