In the last post I described the predictive models, which will be explained in this series. Following the development process for predictive models, the next steps should handle the raw data supply for the predictive models. Fortunately football-data.co.uk already offers all data, which is needed for these models. So this post will explain, how you implement the features for the GS and PPG match rating models based on the existing Raw Data Vault model.
GS match rating features
Calculating the GS rating for a specific match consists of 3 steps. In the first 2 steps you have to calculate the goal difference of the last games for the home and the away team. The goal difference for the home team gets calculated only based on the last home games, the goal difference for the away team gets calculated only based on the last away games. So you take the home advantage into account. After that, the difference between these 2 values produces the GS match rating.
Applying this calculation to the match of Freiburg against Dortmund (09.09.2017) looks like this: Freiburg has scored 9 and conceded 8 goals in the last 7 home games. This results in a goal difference of +1. Dortmund has scored 13 and conceded 10 in the last 7 away games, which results in a goal difference of +3. The GS rating for this match is -2.
How many games you include to calculate the goal difference has to be determined. A smaller number takes the current form more into account. A larger number of matches follows more the generall team strength. For the beginning I have chosen to use the last 7 home and away games.
Based on this knowledge, you are able to calculate the GS match rating for every match available in the data set. If you roughly want to know, how likely it is, that a match with a GS match rating of -2 will end with a home win, you can aggregate the historic results for each match rating. For the Bundesliga following graph shows the historic result percentage for different GS match ratings.
The graph clearly shows different facts. The higher the GS match rating is, the more home wins happened and vice versa. This proves the main assumption of this predictive model. The team, which is superior scoring more and conceding less goals, is more likely to win a match. The second fact, we can see, is, that you are not simply able to use the historic percentage distribution to estimate the probabilities for future matches. The percentage of home wins e.g. with a GS match rating of -9 is higher than with a GS match rating of -8. This is caused by the small number of past matches with such match ratings. Here comes the linear regression into play. But this is not part of this post and will be discussed later.
PPG match rating features
Calculating the PPG match rating follows exactly the same steps as the calculation of the GS match rating, but uses the gained points per match instead of the goal difference. Regarding our example this would look like this: Freiburg gained 12 points in the last 7 home games. So they achieved 1,71 points in average. Dortmund gained 11 points in the last 7 away games, which equates an average of 1,57 points. This results in a PPG match rating of 0,14.
The example nicely shows the difference between both match ratings. The GS match rating indicates Dortmund as the favoured team. Dortmund has got the better goal difference. But this better goal difference did not help Dortmund to gain more points. Instead Freiburg was able to achieve 1 point more with the worse goal difference. So the PPG match rating indicates Freiburg as the favoured team.
Taking a look at the past results percentage with different PPG match ratings, produces following graph:
The graph shows a similar distribution as the graph for the GS match rating. So both ratings prove the main assumption. A team, which gained more points the last games in average, is more likely to win a match. Both graphs also show a similar percentage distribution. So the accuracy for both prediction models should be similar and differences only relate to single matches, like the example showed.
If you already implemented the features for my Poisson model, you should have an idea, how to calculated theses features for a home and away team. The SQL code for the calculation of the GS & PPG match rating for all historic matches looks like this:
As each feature satellite, I described in my posts, this one also integrates into the existing Raw Data Vault model. This satellite uses the last 7 home respectively away games to determine the goal difference and the gained points of each match in the past. Which number of games is the best for the GS & PPG rating model will be looked at in another post. Two separate sub-selects (home: lines 44-80; away: lines 85-121) fulfill this task. An analytical function is used to determine (lines 59,101) and filter (lines 74,116) the number of games. These results get combined with historic matches (line 39). The last step consists of calculating the GS (line 32) & PPG (line 36) match rating for each match.
The next post will describe, how you can use these historic features and linear regression to determine the fair odds for every match. Thereby I also will explain, how you should optimise a linear regression model with R.
If you have further questions, feel free to leave a comment or contact me @Mo_Nbg.