This post will be the start of a new series, where I explain, how to implement another predictive model at the TripleA DWH architecture. When starting developing predictive models with R, I was a little bit overstrained by the different plots provided by R, which can be used to analyse and optimize your predictive model. That’s why I wanted to learn and understand the whole optimizing process in R on base of a simple predictive model. Football-data.co.uk provides an explanation for a small rating system, which uses a linear regression to predict the probability for a home-win, draw or away win. I have chosen this linear regression model, as linear regression is a frequent used and easy to understand predictive method. With a linear regression you can investigate the relationship of the variable, which should be predicted, and one or more features.
The predictive model, provided by football-data.co.uk. is just based on the number of goals scored and conceded by a team. It is called the Goal Superiority Rating System. The basic assumption for this model is: A team, which scores more and concede fewer goals than another team, will be more likely to win a match between both teams. Based on the information, how many goals the teams of a specific match have scored and conceded, a GS (goal superiority) match rating is calculated. In most cases the GS match rating is a small positive or negative number, as in most matches the teams are more or less even. If a team is much more superior over another team, the GS match rating will be a large positive or negative number. Following picture shows the GS match ratings for the German Bundesliga for the seasons 2011/12 – 2016/17:
As expected, the most matches have a small negative or positive GS match rating. Matches, where one team is much more superior over another team, are much more uncommon. If you compare my distribution to the distribution showed at the paper of football-data.co.uk, you are able to recognize, that the peaks of the distributions differ a bit. The peak of my distribution is not located at zero. This is, because I used only the last home matches for the home-team respectively only the last away games for the away-team to calculate the goal differences. By doing this, I want to pay attention to the home advantage of a team.
But is the team, which has got a better goal difference over another team, really more likely to win a match? If this would be the case, the higher the goal difference of a team is, the better their table position should be. The Bundesliga season 2015/16 is a good example for such a situation:
The last team got a final goal difference of -31. After that the goal difference gets better and better with each table position. But sometimes there are years, where teams with a bad goal difference reach a table position, which could not be expected. The Bundesliga season 2016/17 is good example:
Freiburg finished the last season at rank 7 in spite of their bad goal difference of -18. Rank 14 or 15 would have been much more realistic with such a goal difference. Hamburg is just another example. With a goal difference of -28 they should be relegated. But again they managed to stay at the first division.
So what could be an alternative for the GS match rating? What could be more significant than the goal difference of a team? This could be the averaged gained points of a team. The averaged gained points could describe, how a team like Freiburg with a goal difference of -18, achieved the final rank 7. As done for the GS match rating, the averaged points per game can be used to calculate a rating for each match. Following picture shows the PPG (points per game) match ratings for the German Bundesliga for the seasons 2011/12 – 2016/17:
The PPG match rating distribution looks very similar to the GS match rating distribution. The peak of the distribution is located on the home side, which represents the home advantage. And the most amount of matches are tight matches between even teams. So both match ratings should be comparable regarding their possibility to predict the outcome of a football match.
With the next posts of this series, I will explain, how you could build the predictive models for both match ratings. The general goals of this series will be:
- Feature calculation for GS & PPG match rating
- Optimisation of a linear regression prediction problem
- Comparison of GS & PPG accuracy
- Check ability to beat the bookie
If you have further questions, feel free to leave a comment or contact me @Mo_Nbg.