In the first part of this data journey, I took a look at the general definition of expected goals (xG) and the usage of this metric. In the next step in the process of testing the predictive power of xG, I need to get some data. This part will focus on getting the team expected goals statistics. In one of the following parts, I will also take a look on getting the player expected goals statistics as this of course offers even deeper insights.
In the last post I described the predictive models, which will be explained in this series. Following the development process for predictive models, the next steps should handle the raw data supply for the predictive models. Fortunately football-data.co.uk already offers all data, which is needed for these models. So this post will explain, how you implement the features for the GS and PPG match rating models based on the existing Raw Data Vault model.
While browsing the internet and looking for some new inspiration to build an own predictive model, I came upon a very interesting possible feature: the Brier score.
The Brier score is a possibility to measure the accuracy of a predictive model. It gets often used to measure the accuracy for weather forecasts. First I thought, I could use it as a kind of calibration feature for a predictive model. So that a predictive model recognizes, when it was too inaccurate in the past. But using it as a feature to detect teams, which can be predicted well by the bookies or which could cause unexpected results, seems to be a more promising approach. Therefor I want to explain in this post, how to calculate the Brier score based on the last betting odds for a specific team.
In the last post the prototype of the Poisson prediction model has proven, that the optimised model is suitable to beat the bookie – at least for the German Bundesliga. The next step in the predictive model development process consists of implementing the model for forecasting the current fixtures. Regarding this model this part is very easy, as you need not to implement a trained model, just the prediction logic.
During my first investigations for predicting football scores I came across the predictive models of Maher  and Dixon / Coles . Maher modelled the number of goals a team scores during a match as two independent Poisson distributed variables, for the home team and the away team. He assumed that each team has an attacking strength and a defence strength. Dixon / Coles extended this model by adjusting some disadvantages of the Poisson distribution and by using a time dependent attack and defence strength. Both papers are the base of my first predictive model.
In this Post I want to describe, how the attack and defence strength are calculated and how you add this calculation to the existing Data Vault model. The predictive model itself will be explained in another post.