Wednesday, 3 August 2016

Keep Calm n Lets do Baseball Analytics- Sabermetrics- Python

         When it comes to data analytics I keep looking for a domain where it is not applied rather than where it is applied. And my search seems to be a never ending While-loop as the condition of finding a domain where analytics doesn't play role is far visible. Such is the astonishment when one of my professor told about the application of  analytics in Baseball. I was pondered with some bewildered thought and murmured in such low decibel, that is barely heard even by me "Is he kidding ? ", Its a game which is played using bat and ball and the performance of a player depends upon the his ability and there is no way one can  predict which player is gonna rock the field on what time.
          Unconvinced and curious about the case I found that its possible. Awestruck I went through the blog which was provided by my tech-savvy professor on how analytics solve the mystery of analyzing the players performance through machine learning concept using the powerful programming tool "PYTHON"(And Yeah that's the favorite pet in the world of analytics at the present.).
          Before jumping into the baseball analytics, I want to talk about something called SABERMETRICS (I wish you are terrified by the word as I was too.). Its nothing but a fashionable way of telling baseball analytics. (Yeah people always love to hear complex things.). To throw some focus light to our star "Sabermetrics", let me get some help from the guy who knows almost everything (Oops, am not a male chauvinist.Please excuse.. ), the great WIKIPEDIA, and here he defines,

      Sabermetrics is the empirical analysis of baseball, especially baseball statistics that measure in-game activity. It answers almost every questions pertaining to baseball, from how much approximately a team  can score to which player a team can select for the next to have a better chance of getting a desired result.

    Google is far more better place to know more about Sabermetrics (Let me not let me down, I gave some introduction and going to do a step by step analysis using python). . Here we are going to discuss the famous case of sabermetrics, which involves the Oakland Athletics  managed by Billy Beane following the analytical strategies designed by mastermind Paul DePodesta. To simply things, Paul was the guy who used sabermetrics techniques to analyse players and how the team has to perform to get the desired result.

Let's dive into the coding... Its time for action... Let the game begin..

1. Data Intro:
       This dataset is a collection which contains the batting and pitching statistics from 1871 to 2013 (Source: Lahman Baseball).  We will play with the data using Python 2.7 and the libraries which will help us in the analysis are Numpy, Scipy, Pandas, Matplotlib and Statsmodels. Let us get a small peek through about each libraries:

Numpy- NumPy is an open source extension module for Python. The module NumPy provides fast precompiled functions for numerical routines.
It adds support to Python for large, multi-dimensional arrays and matrices. Besides that it supplies a large library of high-level mathematical functions to operate on these arrays.

Scipy-SciPy is widely used in scientific and technical computing. SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE solvers and other tasks common in science and engineering.

Pandas-Pandas is a library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. Pandas is free software released under the three-clause BSD license.

Matplotlib- matplotlib is a plotting library for NumPy. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like wxPython, Qt, or GTK+.

Statsmodels- Statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator.

2. Getting the data into Jupyter Notebook:
         We initially import all the libraries in a single cell and code them as per our convenience (because obvious a bit lazy, hope no one heard.) 
We import the Teams.csv file using the Pandas library and then start our further analysis. The dataset contains 2745 observations and 48 attributes. 



















3. Lets analyze precise,  Yeah Subset :
       After going through all the attributes (impatiently) and upon coming to conclusion about the important attributes (15) that adds value to the model, a subset of the dataset is  made. And yeah it contains data after year 1985.




The dataframe is then indexed using the yearId and the Team Id. 
It just a step for easy access of the dataframe.








The salary.csv is imported into the jupyter notebook  using the Pandas library.  
This data frame contains the salaries of players from 1985 till 2013.
Using this dataset, payroll is calculated. 
We can calculate the payroll for a particular team at a particular year.







The next step involves merging of the salary data set with the team dataset to get a overall view of the much needed information. The payroll data is now stored in a column called salary. Now we can check the payroll of the Oakland Athletics in 2001.




Lets get ourselves busy by generating some plots, a plot showing the relationship between salaries and the number of wins for the year 2001. The colourful matplotlib comes handy to generate such plots.


















The following two functions are used to plot the relationship between salaries with labels and axis formating; as well as highlighting the Oakland Athletics, the New York Yankees, and the Boston Red Sox data.












We then run the plot function which we built by passing two arguments: team and the year and find the relationship between the salaries and the number of wins. This might help the amount of money invested on the players to get the desired result.

















From the plot we can get an interesting result that Oakland Athletics are registering more number of wins with less paid players and out performing many teams which  heavily invest on their players. Interesting !!!!

4. Bill Beane's Formula

         For a Baseball team to win a game, it needs to score more runs than it allows. In the remaining of this tutorial, we will build a mathematical model for runs scored. Similar logic could be applied for modelling runs allowed.
Most teams focused on Batting Average (BA) as a statistic to improve their runs Scored. Bill Beane took a different approach, he focused on improving On Base Percentage (OBP), and Slugging Percentage (SLG).
The Batting Average is defined by the number of hits divided by at bats. It can be calculated using the formula below:
BA = H/AB
On-base Percentage is a measure of how often a batter reaches base for any reason other than a fielding error, fielder's choice, dropped/uncaught third strike, fielder's obstruction, or catcher's interference. It can be calculated using the formula below:
OBP = (H+BB+HBP)/(AB+BB+HBP+SF)
Slugging Percentage is a measure of the power of a hitter. It can ve calculated using the formula below:
SLG = H+2B+(2*3B)+(3*HR)/AB
(referred from Bill Beane's Formula)




          The above three variables are added to the dataframe which will be very helpful in our model. We then build a linear regression model to predict the runs scored by the team which is one of the most important parameter in assessing the team performance. 
         Here we build 3 different model: one with BA, OBP and SLG, second with OBP and SLG and the third with the feature BA only. We use the exciting statsmodels library to build the regression model.










We can look at a summary statistic of these models by running:


























The first model has an Adjusted R-squared of 0.918, with 95% confidence interval.
This is counterintuitive, since we expect the BA value to be positive. This is due to a multicollinearity (a high correlation existing between attributes) between the variables.
The second model has an Adjusted R-squared of 0.919, and the last model an Adjusted R-squared of 0.500. The AIC and BIC are also the lowest compared to the other model which gives a clear idea of the model which gives a accurate result.

Based on this analysis, we could confirm that the second model using OBP and SLG is the best model for predicting Run Scored.
      Based on the analysis above, a good strategy for recruiting batters would focus on targeting undervalued players with high OBP and SLG. In the late 1990s, the old school scouts overvalued BA, and players with high BA had high salaries. Although BA and OBP have a positive correlation, there were some players that have high OBP and SLG, and relatively small BA. These players were undervalued by the market, and were the target of Billy Beane.
     

Reference:

http://adilmoujahid.com/posts/2014/07/baseball-analytics/
http://www.datasciencecentral.com/profiles/blogs/9-python-analytics-libraries-1