Moving averages are particularly useful in finance and business domain where time series data is used to calculate important insights. We will attempt to understand usage of Pandas rolling
function (pandas.DataFrame.rolling
) by using it on an IPL batsman’s performance over the course of 10 years of international cricket.
Given below are the batting performances of two Indian batsman MS Dhoni and SR Tendulkar over 10 years of their career on some selected matches against popular teams. Asterisk (*) appended with runs shows the batsman was not out at final status of the match.
In cricket, a batsman’s ability to score runs in a particular match has a significant correlation to runs scored by him over past few months. This is considered as ‘form’ of the batsman. A batsman in good form is preferred over others when the 11 players are decided for a match.
The most important statistic used to evaluate a batsman’s performance is their batting average over a certain period of time. Formula for batting average is:
Batting Average = Runs Scored / Number of times dismissed (Out)
We are interested to calculate “Form Average” of batsman which is nothing but the mean of runs scored by the batsman over the past 1 year from the date of the match. The formula has two parts that need separate calculation. Let’s calculate rolling sum of the runs and number of dismissals of the two batsman over the past 1 year from the date of the match.
Syntax of rolling function:
DataFrame.rolling(window, win_type=None, on=None, closed=None)
We are interested in window size with an offset. As given in official documentation of pandas, each window that is declared as offset will be variable sized based on the observations included within the time-period defined by offset. We will see this in an example.
The offset is declared as a string like ‘365D’ where D represents days. Pandas uses datetime information if available in the index or as defined by the ‘on’ parameter.
Before we are allowed to proceed, we must prepare the dataframe. We will create a new column to capture Number of times dismissed (Out) from Runs column. In fourth line we remove asterisk (*) to maintain consistency with calculations.
Before creating a new column, we will create two Series objects pandas.core.series.Series
to store rolling sum corresponding to player name and number of dismissals in last 365 days.
This gives us two series:
Before we can store this series object into the dataframe we must set Name and Date as dataframe indexes to make it compatible with the order of the two series. Then we can divide roll_outs
with roll_runs
to get a rolling average of 365 days.
So the final df object looks like so:
We can observe few values as inf as the number of dismissals equal to zero in these cases. This is the fault with the way batting average is calculated which is undefined if the batsman has not been dismissed for the duration of matches average is being calculated.
If we wish to avoid such values, we can replace number of dismissals by 1 when a value 0 is found like so: