# Using xG & advanced stats to predict football matches

With the BeatTheBookieDataService in place it’s also time to provide some new models. This post will take a look at possible models using the team statistics provided for each match by understat.com. Therefor I will compare 3 of the most used machine learning algorithms. Beside this, it’s also time to test again some basics for predictiv modeling for football: “To differ between home/away performance or not to differ”? For my Poisson models I always differed between home and away performance. But is this also needed, when using ML algorithms?

## Understat team statistics

Understat.com provides different statistics on team and player level. In the last post I already compared the predictive performance of goals and xG. For this post I also incoperate the other team stats, which are provided on the website, as they all provide some information about the strength of a team.

### Goals / xG

I have been already writing much about goals and xG. These are definitely the two most discussed statistics in the current time. Both together they provide the information, how many goals a team scored and how many a team should have scored.

### Shots / Shots on target

Both shot stats are useful to get more context about the quality of chances, if e.g. two teams have the same amount of goals or xG. Let’s asume two teams are both scoring on average 1 xG per match. One team on average takes 10 shots, the other one just 5. So the second team is definitely able to create more dangerous scoring opportunities, as they create 0.2 xG per shot and the other team just 0.1 xg per shot.

### Passes allowed per defence action (PPDA)

PPDA is a great indicator the determine the pressing intensity of a team. The higher the pressing of a team the lower the PPDA metric as the opponent team is not able to play many passes without getting pressed into defence actions. In comparison the PPDA value of a team is high, when they are playing a low block and do not actively pressure the opponent.

### Deep

Deep (passes) measures the completed passes within abount 20 yards to the goals. So that’s every pass in and around the penalty area. This metric is a great indicator, how dominant a team played. The closer successfull passes are to the opponent goal, the higher the probability such a pass can lead to a shot. Possession was never a good indicator about how dangerous a team is, but having possession in and around the 18-yard-box is definitely dangerours.

## To differ between home/away or not to differ?

As in my last blog post, I don’t just want to test a new model, I also want to test a basic assumption, when building new models. This time I want to take a look at, whether you should differ between home and away performance. But why should you differ between home and away performance? The most important point, is of course the home advantage. Everybody knows the home advantage in football. A team playing at home has a higher probability of winning. Another argument for differing between home and away performance are team tactics. Beside top teams, which dominate just every team in their domestic league, smaller teams may have different tactics for home and away games. At away games teams may play more passive with a lower pressing intensity.
That’s why I will test 2 types of models in this blog post. One model uses all matches of a team to calculate the features. The other model will just use the home matches to calculate the features to predict home matches and the past away matches to predict the away performance.

## Model list

Another factor you need to define for you model is the time span you want to include in the calculation and the weighting between the latest results and older matches. How these factors effect a model was part of the last model blog post. That’s why I decided to just use a 15 matches model with an exponential weighting. Combining this with the home-away-feature calculation, I am this time just comparing two different models.

## Model features

Based on the available Understat match statistics and my expierence with the classical Poisson models, I am using the different strength features (e.g. home goals for strength). Such a strength feature basically indicates the ratio of a team to the average of the league. A team, which has a home goals for strength feature of 1.2 on average scores 20% more goals than the average in the league. Going this way has two main advantages: the scale of the features is more or less the same with 1 as the center of a feature. So you might not care about normalization or standardization in the first step.

So we need to calculate following features for each match:

### Feature Calculation

The strength features are calculated based on the league average and the team averages for the respective feature:

TeamStrength = TeamAverage / LeagueAverage

For the league average I decided to use a 500 matches simple moving average (SMA500). We want a relative stable long term average for a league and 500 matches is more than all matches for a single season. As already describted in the model list the team average is calculated with an exponential moving average of the last 15 matches. Following this, we need e.g. following calculation for the HomeGoalsForStrength feature:

HomeGoalsForStrengthEMA15 = TeamAverageHomeGoalsEMA15 / LeagueAverageHomeGoalsSMA500

If you did not follow one of my older posts, the base for my samples is my data model containing data from football-data.co.uk and unterstat data. All calculations done inside my database are done in SQL, but can of course be also be transfered into other languages. All used sources can be found at my Github repository. If you are not interested in the implementation of the models, you can directly jump to the model simulation part.

For all features I needed a combination of football-data.co.uk data and understat data, as the understat data was missing the shot information. (line 65-68). For the calculation of the EMA500 stats, I used an analytical window function. The average is calculated per division. As a window the latest 500 matches ordered by the match date are used. (line 10)

create or replace table betting_dv.football_match_his_l_s_understat_league_stats_sma as
/*
* 500 matches are take to calculate the avg goals for a division
*/
select
his.football_match_his_lid,
-----
--normal goals
--SMAs for home goals for
round(avg(stats.home_goals) over (partition by division order by match_date rows between 500 preceding and current row exclude current row),1) home_goals_for_sma500,
--SMAs for home goals against
round(avg(stats.away_goals) over (partition by division order by match_date rows between 500 preceding and current row exclude current row),1) home_goals_against_sma500,
--SMAs for away goals fo
round(avg(stats.away_goals) over (partition by division order by match_date rows between 500 preceding and current row exclude current row),1) away_goals_for_sma500,
--SMAs for away goals against
round(avg(stats.home_goals) over (partition by division order by match_date rows between 500 preceding and current row exclude current row),1) away_goals_against_sma500,
------
--expected goals
--SMAs for home xgoals for
round(avg(stats.home_xgoals) over (partition by division order by match_date rows between 500 preceding and current row exclude current row),1) home_xg_for_sma500,
--SMAs for home xgoals against
round(avg(stats.away_xgoals) over (partition by division order by match_date rows between 500 preceding and current row exclude current row),1) home_xg_against_sma500,
--SMAs for away xgoals for
round(avg(stats.away_xgoals) over (partition by division order by match_date rows between 500 preceding and current row exclude current row),1) away_xg_for_sma500,
--SMAs for away xgoals against
round(avg(stats.home_xgoals) over (partition by division order by match_date rows between 500 preceding and current row exclude current row),1) away_xg_against_sma500,
--shots
--SMAs for home shots for
round(avg(stats2.home_shots) over (partition by division order by match_date rows between 500 preceding and current row exclude current row),1) home_shots_for_sma500,
--SMAs for home shots against
round(avg(stats2.away_shots) over (partition by division order by match_date rows between 500 preceding and current row exclude current row),1) home_shots_against_sma500,
--SMAs for away shots for
round(avg(stats2.away_shots) over (partition by division order by match_date rows between 500 preceding and current row exclude current row),1) away_shots_for_sma500,
--SMAs for away shots against
round(avg(stats2.home_shots) over (partition by division order by match_date rows between 500 preceding and current row exclude current row),1) away_shots_against_sma500,
--shots_on_target
--SMAs for home shots_on_target for
round(avg(stats2.home_shots_target) over (partition by division order by match_date rows between 500 preceding and current row exclude current row),1) home_shots_on_target_for_sma500,
--SMAs for home shots_on_target against
round(avg(stats2.away_shots_target) over (partition by division order by match_date rows between 500 preceding and current row exclude current row),1) home_shots_on_target_against_sma500,
--SMAs for away shots_on_target for
round(avg(stats2.away_shots_target) over (partition by division order by match_date rows between 500 preceding and current row exclude current row),1) away_shots_on_target_for_sma500,
--SMAs for away shots_on_target against
round(avg(stats2.home_shots_target) over (partition by division order by match_date rows between 500 preceding and current row exclude current row),1) away_shots_on_target_against_sma500,
--ppda
--SMAs for home ppda for
round(avg(stats.home_ppda) over (partition by division order by match_date rows between 500 preceding and current row exclude current row),1) home_ppda_for_sma500,
--SMAs for home ppda against
round(avg(stats.away_ppda) over (partition by division order by match_date rows between 500 preceding and current row exclude current row),1) home_ppda_against_sma500,
--SMAs for away ppda for
round(avg(stats.away_ppda) over (partition by division order by match_date rows between 500 preceding and current row exclude current row),1) away_ppda_for_sma500,
--SMAs for away ppda against
round(avg(stats.home_ppda) over (partition by division order by match_date rows between 500 preceding and current row exclude current row),1) away_ppda_against_sma500,
--deep
--SMAs for home deep for
round(avg(stats.home_deep) over (partition by division order by match_date rows between 500 preceding and current row exclude current row),1) home_deep_for_sma500,
--SMAs for home deep against
round(avg(stats.away_deep) over (partition by division order by match_date rows between 500 preceding and current row exclude current row),1) home_deep_against_sma500,
--SMAs for away deep for
round(avg(stats.away_deep) over (partition by division order by match_date rows between 500 preceding and current row exclude current row),1) away_deep_for_sma500,
--SMAs for away deep against
round(avg(stats.home_deep) over (partition by division order by match_date rows between 500 preceding and current row exclude current row),1) away_deep_against_sma500
from
betting_dv.football_match_his_b his
join raw_dv.football_match_his_l_s_understat_team_stats stats on
his.football_match_his_lid = stats.football_match_his_lid
JOIN RAW_DV.FOOTBALL_MATCH_HIS_L_S_STATISTIC stats2
ON his.FOOTBALL_MATCH_HIS_LID = stats2.FOOTBALL_MATCH_HIS_LID
;


In the 2nd step the team averages need to be calculated. Here we have to differ between the two model types. Calculating the EMA15 value for the HA model is easier. The historic stats just need to be separately determined for a team playing at home and away. As the stats contain the information for the current match, the stats need to be shifted by one match. This can be done via the analytical function LEAD (line 34, 75). This function determines the previous home or away match ID for the team.

with
home_his as
(
/**
get historic matches for home team
statistics are projected for 1 match in the past,
as the result of the current match should not be part of
the prediction features
**/
select
his.football_match_his_lid,
his.home_team,
his.match_date,
--home stats
stats.home_goals 				home_goals_for,
stats.away_goals				home_goals_against,
stats.home_xgoals 			home_xg_for,
stats.away_xgoals 			home_xg_against,
stats2.home_shots				home_shots_for,
stats2.AWAY_SHOTS				home_shots_against,
stats2.HOME_SHOTS_TARGET 	home_shots_on_target_for,
stats2.AWAY_SHOTS_TARGET 	home_shots_on_target_against,
stats.HOME_PPDA 				home_ppda_for,
stats.AWAY_PPDA 				home_ppda_against,
stats.HOME_DEEP 				home_deep_for,
stats.AWAY_DEEP 				home_deep_against
from
(
select
his.football_match_his_lid,
his.match_date,
his.home_team,
his.away_team,
lead(football_match_his_lid) over (partition by his.home_team order by his.match_date desc) prev_football_match_his_lid
from
betting_dv.football_match_his_b his
) his
join raw_dv.football_match_his_l_s_understat_team_stats stats
on his.prev_football_match_his_lid = stats.football_match_his_lid
JOIN RAW_DV.FOOTBALL_MATCH_HIS_L_S_STATISTIC stats2
ON his.FOOTBALL_MATCH_HIS_LID = stats2.FOOTBALL_MATCH_HIS_LID
),
away_his as
(
/**
get historic matches for away team
statistics are projected for 1 match in the past,
as the result of the current match should not be part of
the prediction features
**/
select
his.football_match_his_lid,
his.away_team,
his.match_date,
--away stats
stats.away_goals 				away_goals_for,
stats.home_goals				away_goals_against,
stats.away_xgoals 			away_xg_for,
stats.home_xgoals 			away_xg_against,
stats2.home_shots				away_shots_for,
stats2.AWAY_SHOTS				away_shots_against,
stats2.HOME_SHOTS_TARGET 	away_shots_on_target_for,
stats2.AWAY_SHOTS_TARGET 	away_shots_on_target_against,
stats.away_PPDA 				away_ppda_for,
stats.home_PPDA 				away_ppda_against,
stats.away_DEEP 				away_deep_for,
stats.home_DEEP 				away_deep_against
from
(
select
his.football_match_his_lid,
his.match_date,
his.home_team,
his.away_team,
lead(football_match_his_lid) over (partition by his.away_team order by his.match_date desc) prev_football_match_his_lid
from
betting_dv.football_match_his_b his
) his
join raw_dv.football_match_his_l_s_understat_team_stats stats
on his.prev_football_match_his_lid = stats.football_match_his_lid
JOIN RAW_DV.FOOTBALL_MATCH_HIS_L_S_STATISTIC stats2
ON his.FOOTBALL_MATCH_HIS_LID = stats2.FOOTBALL_MATCH_HIS_LID
)


For each single statistic the exponential moving average based on the last 15 matches (EMA15) has to be cacluated. I am using my good old EMA user-defined function. This has to be done for all 24 features. And after all the results of the EMA calculations can be put together for each match based on the match date and the home or the away team.

home_goals_for_ema15 as
(
select
betting_dv.ema(home_team, match_date, home_goals_for, 15)
from
home_his
group by
home_team
)


Doing all this for the combined model, looks a bit different and a bit more complicated. The starting point need to be a list of all teams (lines 53-57), which is joined with the historic matches, as you need to get the stats for each team playing home and away. A team playing at home or away also need to be considered, when selecting the correct historic stats for the team averages. One example (line 33): For the team average stat GOALS_FOR, the home goals have to be considered, if a team played at home (HOME_AWAY = ‘H‘) and the away goals, if the team played away. The LEAD() function is again used to in this case determine the previous match date to move all stats by one match (line 19). In comparison the team from the team list is used as a partition information and not the home or away team. At this point it’s not known, whether the previous match was a home or away match.

WITH mvd_stats
AS
(
/**
get historic matches for teams
statistics are projected for 1 match in the past,
as the result of the current match should not be part of
the prediction features,
not difference between home and away
**/
SELECT
his.football_match_his_lid,
his.MATCH_DATE,
--get next match, to exclude stats for current one
--stats are moved by one match
lag(his.football_match_his_lid) over (partition by t.FOOTBALL_TEAM_HID order by his.match_date desc) next_football_match_his_lid,
lag(his.MATCH_DATE) OVER (PARTITION BY t.FOOTBALL_TEAM_HID ORDER BY his.MATCH_DATE desc) next_match_date,
lead(his.football_match_his_lid) over (partition by t.FOOTBALL_TEAM_HID order by his.match_date desc) prev_football_match_his_lid,
lead(his.MATCH_DATE) OVER (PARTITION BY t.FOOTBALL_TEAM_HID ORDER BY his.MATCH_DATE desc) prev_match_date,
--
his.FOOTBALL_TEAM_HOME_HID,
his.HOME_TEAM,
his.FOOTBALL_TEAM_AWAY_HID,
his.AWAY_TEAM,
--
t.FOOTBALL_TEAM_HID,
t.TEAM ,
CASE
WHEN t.FOOTBALL_TEAM_HID = his.FOOTBALL_TEAM_HOME_HID THEN 'h'
WHEN t.FOOTBALL_TEAM_HID = his.FOOTBALL_TEAM_AWAY_HID THEN 'a'
ELSE 'n.a.'
END home_away,
CASE WHEN LOCAL.home_away = 'h' THEN s1.home_goals else s1.away_goals END						goals_for,
CASE WHEN LOCAL.home_away = 'h' THEN s1.away_goals else s1.home_goals END						goals_against,
CASE WHEN LOCAL.home_away = 'h' THEN s1.home_xgoals else s1.away_xgoals END						xg_for,
CASE WHEN LOCAL.home_away = 'h' THEN s1.away_xgoals else s1.home_xgoals END						xg_against,
CASE WHEN LOCAL.home_away = 'h' THEN s1.home_npxg else s1.away_npxg END							npxg_for,
CASE WHEN LOCAL.home_away = 'h' THEN s1.away_npxg else s1.home_npxg END							npxg_against,
CASE WHEN LOCAL.home_away = 'h' THEN s2.home_shots else s2.away_shots END						shots_for,
CASE WHEN LOCAL.home_away = 'h' THEN s2.away_shots else s2.home_shots END						shots_against,
CASE WHEN LOCAL.home_away = 'h' THEN s2.home_shots_target else s2.away_shots_target END			shots_target_for,
CASE WHEN LOCAL.home_away = 'h' THEN s2.away_shots_target else s2.home_shots_target END			shots_target_against,
CASE WHEN LOCAL.home_away = 'h' THEN s1.home_deep else s1.away_deep END							deep_for,
CASE WHEN LOCAL.home_away = 'h' THEN s1.away_deep else s1.home_deep END							deep_against,
CASE WHEN LOCAL.home_away = 'h' THEN s1.home_ppda else s1.away_ppda END							ppda_for,
CASE WHEN LOCAL.home_away = 'h' THEN s1.away_ppda else s1.home_ppda END							ppda_against
from
betting_dv.football_match_his_b his
JOIN RAW_DV.FOOTBALL_MATCH_HIS_L_S_UNDERSTAT_TEAM_STATS s1
ON his.FOOTBALL_MATCH_HIS_LID = s1.FOOTBALL_MATCH_HIS_LID
JOIN RAW_DV.FOOTBALL_MATCH_HIS_L_S_STATISTIC s2
ON his.FOOTBALL_MATCH_HIS_LID = s2.FOOTBALL_MATCH_HIS_LID
--join team for away and home to get
--complete historic matches
JOIN RAW_DV.FOOTBALL_TEAM_H t
ON his.FOOTBALL_TEAM_HOME_HID = t.FOOTBALL_TEAM_HID
OR his.FOOTBALL_TEAM_AWAY_HID = t.FOOTBALL_TEAM_HID
)


Calculating the EMA15 follows the same way and the results just again need to be joined with the historic matches. Whoever wants to take a deeper look in the complete SQL samples, will find everything published at my GitHub repo.

### Correlation test

We now got an overall amount of 24 features. It’s now interesting to know, whether each feature really effects the output classes (Home win, Draw, Away win). Therefor we can use a correlation test(1). This simple test provides us the information, whether there is a linear relation between a feature and the target class.

import pyexasol
import pandas as pd
import matplotlib.pyplot as plt

dwh_host = '[host-name]'
dwh_port = '8563'
dwh_user = '[DB-user]'

v_sql = """SELECT
round(f.home_goals_for_STRENGTH,1) feature,
avg(CASE WHEN s.HOME_GOALS > s.AWAY_GOALS THEN 1 ELSE 0 END) home_win_prob
FROM
BETTING_DV.FOOTBALL_MATCH_HIS_B his
JOIN BETTING_DV.FOOTBALL_MATCH_HIS_L_S_UNDERSTAT_TEAM_STRENGTH_FEATURES_EMA15_ha f
ON his.FOOTBALL_MATCH_HIS_LID = f.FOOTBALL_MATCH_HIS_LID
JOIN RAW_DV.FOOTBALL_MATCH_HIS_L_S_UNDERSTAT_TEAM_STATS s
ON his.FOOTBALL_MATCH_HIS_LID = s.FOOTBALL_MATCH_HIS_LID
GROUP BY
1
having
count(*) > 20
ORDER BY 1;"""

#connect to db
dwh_conn = pyexasol.connect(dsn=dwh_host + ':' + dwh_port, user=dwh_user, password=dwh_pass)

pd_cor = dwh_conn.export_to_pandas(v_sql)

dwh_conn.close()

x = pd_cor.iloc[:,0]
y = pd_cor.iloc[:,1]

plt.scatter(x, y)
plt.xlabel("Away PPDA for strength")
plt.ylabel("Home win percentage")
plt.show()

pd_cor.corr(method ='pearson')


The feature used for the correlation test and the ouput classed are directly selected in the SQL statement (line 11-12). The result of the SQL statement is directly transfered into a Pandas dataframe via the Pyexasol package. (line 29). Then Cor() function of Pandas provides the possibility to calculate the correlation between the single feature and the output class. (line 41)

The selected examples all show a strong correlation with the target variable. The higher the home xg for strength of a team is, the more goals a teams scores at home in comparison to an average team. And the higher the probabilty of winning a match. A different case is e.g. the Away PPDA for strength. This feature also has a strong linear relation with the output class, but it’s inverse. The lower the PPDA value of a team is, the better they press the Opponent. That’s why a lower PPDA strength value indicates a higher probaility of winning a match.

Comparing the feature correlation for both model types, does not show any huge difference. All features strongly correlate with the target variable. The combined feature partly have a slightly higher correlation. Will this influence the model performance? We will see.

### Feature importance

Another interesting check I am always doing before testing a new model is the feature importance(2)(3). This provides you some information, which feature is how important

import pyexasol
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier

dwh_host = '[host-name]'
dwh_port = '8563'
dwh_user = '[DB-user]'

v_sql = """-- BETTING_MART.UNDERSTAT_TEAM_MODEL_FEATURES_EMA15 source
SELECT
--model features
f.HOME_GOALS_FOR_STRENGTH,
f.HOME_GOALS_AGAINST_STRENGTH,
f.HOME_XG_FOR_STRENGTH,
f.HOME_XG_AGAINST_STRENGTH,
f.HOME_SHOTS_FOR_STRENGTH,
f.HOME_SHOTS_AGAINST_STRENGTH,
f.HOME_SHOTS_ON_TARGET_FOR_STRENGTH,
f.HOME_SHOTS_ON_TARGET_AGAINST_STRENGTH,
f.HOME_PPDA_FOR_STRENGTH,
f.HOME_PPDA_AGAINST_STRENGTH,
f.HOME_DEEP_FOR_STRENGTH,
f.HOME_DEEP_AGAINST_STRENGTH,
f.AWAY_GOALS_FOR_STRENGTH,
f.AWAY_GOALS_AGAINST_STRENGTH,
f.AWAY_XG_FOR_STRENGTH,
f.AWAY_XG_AGAINST_STRENGTH,
f.AWAY_SHOTS_FOR_STRENGTH,
f.AWAY_SHOTS_AGAINST_STRENGTH,
f.AWAY_SHOTS_ON_TARGET_FOR_STRENGTH,
f.AWAY_SHOTS_ON_TARGET_AGAINST_STRENGTH,
f.AWAY_PPDA_FOR_STRENGTH,
f.AWAY_PPDA_AGAINST_STRENGTH,
f.AWAY_DEEP_FOR_STRENGTH,
f.AWAY_DEEP_AGAINST_STRENGTH,
--target variables
CASE WHEN s.HOME_GOALS > s.AWAY_GOALS THEN 1 ELSE 0 END home_win,
CASE WHEN s.HOME_GOALS = s.AWAY_GOALS THEN 1 ELSE 0 END draw,
CASE WHEN s.HOME_GOALS < s.AWAY_GOALS THEN 1 ELSE 0 END away_win
FROM
BETTING_DV.FOOTBALL_MATCH_HIS_B his
JOIN BETTING_DV.FOOTBALL_MATCH_HIS_L_S_UNDERSTAT_TEAM_STRENGTH_FEATURES_EMA15_ha f
ON his.FOOTBALL_MATCH_HIS_LID = f.FOOTBALL_MATCH_HIS_LID
JOIN RAW_DV.FOOTBALL_MATCH_HIS_L_S_UNDERSTAT_TEAM_STATS s
ON his.FOOTBALL_MATCH_HIS_LID = s.FOOTBALL_MATCH_HIS_LID
WHERE
--features have to exist
f.HOME_GOALS_FOR_STRENGTH IS NOT NULL AND
f.HOME_SHOTS_FOR_STRENGTH IS NOT NULL and
f.AWAY_GOALS_FOR_STRENGTH IS not NULL and
f.AWAY_SHOTS_FOR_STRENGTH IS not NULL and
--first season is skipped to have stable features
his.SEASON <> '2014_2015';"""

#connect to db
dwh_conn = pyexasol.connect(dsn=dwh_host + ':' + dwh_port, user=dwh_user, password=dwh_pass)

pd_ft_imp = dwh_conn.export_to_pandas(v_sql)

dwh_conn.close()

pd_feature = pd_ft_imp.iloc[:,0:24]

pd_class = pd_ft_imp.iloc[:,24]

model=RandomForestClassifier()
model.fit(pd_feature,pd_class)
feature_importances=pd.DataFrame({'features':pd_feature.columns,'feature_importance':model.feature_importances_})
feature_importances.sort_values('feature_importance',ascending=False)


To determine the feature importance all features and target variables have to be loaded into a data frame and a single target variable need to be selected. Multiple models provide the possibilty to output the feature importance after a model training. I always like to use a simple random forest classifier. (line 71-74)

The analysis shows, that nearly all feature have a similar importance. An even distributed importance for our 24 feature would be 0.41. The most important feature for the home win probality is the home xg for strength with 0.51. The most irrelavant feature home shots against strength has a value of 0.36. This distribution shows, that there’s not really a feature, which we could drop.

## Model training

### ML algorithms

Predicting the outcome of a football match is a typical multi-class classification problem (4). We got 3 different outcome classes: Home-Win, Draw, Away-Win. I decided to compare 3 different machine algorithms:

• Logistic regression
• RandomForest
• XGBoost

In the following code examples, I will just show, how I used the XGBoost algorithm. The logistic regression and the random forest can be handled the same way.

# define models and parameters
model = XGBClassifier(objective ='multi:softprob')


### Getting the data

But before we can start training the model, we need to get the training data. This part was already described in the feature importance abstract. There’s just on adaption needed: For the chosen models and the cross-validation method, the predictive class must not be shaped as a one-hot-encoded vector. A single column containing the different results is needed.

case
when home_win = 1 then 'H'
when draw = 1 then 'D'
else 'A'
end result


The result data frame need to be splitted up into model features and result classes.

x = df_data.iloc[:,:24]
y = df_data.iloc[:,27:28]


### Hyperparameter tuning

Each ML algorithm provides different hyperparameters(6). They have to be used to configure the model and improve the overall model performance. Every model has different hyperparameter and a different number of hyperparamters. Simpler models like the logistic regression have a small number of paramters. So the tuning is easier. For models like XGBoost this can get a bit time consuming. Add the beginning I always create a list of all paramters, which I want to optimize. Most of the time I just optimize 2 paramters at once. All parameters are stored in a dictionary object.

#
# define hyperparameters
learning_rate = [0.05, 0.07,0.1]
n_estimators = [40,50,60,70,80,90,100,110,120,130,140,150,160,170,180,190,200,210,220,230,240,250]

max_depth = [2]
min_child_weight = [10]
subsample = [0.9]
colsample_bytree = [0.5]
gamma  = [0.2]

#
# define grid search
grid = dict(learning_rate=learning_rate,
max_depth=max_depth,
min_child_weight=min_child_weight,
gamma=gamma,
subsample=subsample,
colsample_bytree=colsample_bytree,
n_estimators = n_estimators
)


Training just one model for a parameter combination might provide wrong picture of the model performance. That’s why a cross-validation is a standard method to estimate the model performance. During the cross-validation the model is trained and compared multiple times with different splits of the dataset. The combination of cross-validation and grid-seach provides the informaton, which hyperparameter are needed for best model performance.

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=20)
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(x, y)


For visualising the results of a hyperparamter search, I found a small bit of code at stackoverflow, which helped me to better understand the results.

def plot_grid_search(cv_results, grid_param_1, grid_param_2, name_param_1, name_param_2):
# Get Test Scores Mean and std for each grid search
scores_mean = cv_results['mean_test_score']
scores_mean = np.array(scores_mean).reshape(len(grid_param_2),len(grid_param_1))

scores_sd = cv_results['std_test_score']
scores_sd = np.array(scores_sd).reshape(len(grid_param_2),len(grid_param_1))

# Plot Grid search scores
_, ax = plt.subplots(1,1)

# Param1 is the X-axis, Param 2 is represented as a different curve (color line)
for idx, val in enumerate(grid_param_2):
ax.plot(grid_param_1, scores_mean[idx,:], '-o', label= name_param_2 + ': ' + str(val))

ax.set_title("Grid Search Scores", fontsize=20, fontweight='bold')
ax.set_xlabel(name_param_1, fontsize=16)
ax.set_ylabel('CV Average Score', fontsize=16)
ax.legend(loc="best", fontsize=15)
ax.grid('on')

plot_grid_search(grid_result.cv_results_, n_estimators, learning_rate, 'n_estimators', 'learning_rate')


Based on the graph I was able to identifiy, whether a high accuracy was really some kind of maximum or just a fluctuating accuracy. In this example the grid search clearly indicates, that the parameter n_estimator with values around 110 provides the best model performance when using a learning rate of 0.07.

## Model results

Running the hyperparameter tuning for all selected ML algorithms provided following results.

It’s possible to already identify two important facts. All models provide a similar performance. There’s not really a big difference between the models. Differing between home and away performance does not really provide and advantage in comparison to the combined models. The second fact is a bit suprising for me. I expected the XGBoost algorithm as the most advanced one also provide the best model results. Instead the simple logistic regression model has the highest accuracy. Will this small advantage also show a difference during the simulation? We will see.

## Model simulation

The hyperparameter optimization already provides an indication, which model has the highest accuracy. But to get the real performance, the different models need to be simulated over time. To have some kind of real world simulation, I decided to not randomly split training and test sets, but instead use time dependent interations. The season 2015/16 is used as an initial training set for the 1st iteration. The trained models are used to predict the results for all matches in August 2016, the first month of the season 2016/17. In the next iteration, August 2016 will also be part of the training data set and the next month will be predicted. So the training set becomes larger and larger. It will be interesting to see, whether the size of the training set also has some impact on the model performance in the first iterations.

All iterations steps are created based on the available match dates in the historic data (line 6). The season 2014/15 is again skipped as it’s the first season with data, and season 2015/16 represents the first training set as already mentioned (lines 16-19).

#get iterator month data
def get_iterator_data():

#get iterator data
v_sql = """SELECT distinct
to_char(his.MATCH_DATE,'yyyymm') iterator
FROM
BETTING_DV.FOOTBALL_MATCH_HIS_B his
JOIN BETTING_DV.FOOTBALL_MATCH_HIS_L_S_UNDERSTAT_TEAM_STRENGTH_FEATURES_EMA15_ha f
ON his.FOOTBALL_MATCH_HIS_LID = f.FOOTBALL_MATCH_HIS_LID
WHERE
--2014_15 season dropped for stable features
season <> '2014_2015'
--2015_16 is first training season
AND season <> '2015_2016'
ORDER BY 1;"""

return dwh_conn.export_to_pandas(v_sql)


The data is written directly into a pandas data frame, so that it’s easy to loop over the whole iterator set. I am looping over the set two times. One time for all combined models and one time for all home-away models. This could also be done in a single loop. The next code samples will only show the training and prediction of the combined logistic regression model. All other models follow the same logic.

#get iterator data
df_iterator = get_iterator_data()

#
# combined models
#

#loop over month
for index, row in df_iterator.iterrows():
print('Prediction for: ' + str(row['ITERATOR']))


The iterator is used as a filter for the training data set. All matches before the provided month filter are used for the training. As for the hyperparamter optimization, the training set need to consist of all model features and the result classes.

def get_training_data_comb(month_filter):

#load training data based on provided filter
v_sql = """-- BETTING_MART.UNDERSTAT_TEAM_MODEL_FEATURES_EMA15 source
SELECT
--model features
f.HOME_GOALS_FOR_STRENGTH,
f.HOME_GOALS_AGAINST_STRENGTH,
f.HOME_XG_FOR_STRENGTH,
f.HOME_XG_AGAINST_STRENGTH,
f.HOME_SHOTS_FOR_STRENGTH,
f.HOME_SHOTS_AGAINST_STRENGTH,
f.HOME_SHOTS_ON_TARGET_FOR_STRENGTH,
f.HOME_SHOTS_ON_TARGET_AGAINST_STRENGTH,
f.HOME_PPDA_FOR_STRENGTH,
f.HOME_PPDA_AGAINST_STRENGTH,
f.HOME_DEEP_FOR_STRENGTH,
f.HOME_DEEP_AGAINST_STRENGTH,
f.AWAY_GOALS_FOR_STRENGTH,
f.AWAY_GOALS_AGAINST_STRENGTH,
f.AWAY_XG_FOR_STRENGTH,
f.AWAY_XG_AGAINST_STRENGTH,
f.AWAY_SHOTS_FOR_STRENGTH,
f.AWAY_SHOTS_AGAINST_STRENGTH,
f.AWAY_SHOTS_ON_TARGET_FOR_STRENGTH,
f.AWAY_SHOTS_ON_TARGET_AGAINST_STRENGTH,
f.AWAY_PPDA_FOR_STRENGTH,
f.AWAY_PPDA_AGAINST_STRENGTH,
f.AWAY_DEEP_FOR_STRENGTH,
f.AWAY_DEEP_AGAINST_STRENGTH,
--prediction class
case
when s.HOME_GOALS  > s.AWAY_GOALS then 'H'
when s.HOME_GOALS = s.AWAY_GOALS then 'D'
else 'A'
end match_result
FROM
BETTING_DV.FOOTBALL_MATCH_HIS_B his
JOIN BETTING_DV.FOOTBALL_MATCH_HIS_L_S_UNDERSTAT_TEAM_STRENGTH_FEATURES_EMA15 f
ON his.FOOTBALL_MATCH_HIS_LID = f.FOOTBALL_MATCH_HIS_LID
JOIN RAW_DV.FOOTBALL_MATCH_HIS_L_S_UNDERSTAT_TEAM_STATS s
ON his.FOOTBALL_MATCH_HIS_LID = s.FOOTBALL_MATCH_HIS_LID
WHERE
--features have to exist
f.HOME_GOALS_FOR_STRENGTH IS NOT NULL AND
f.HOME_SHOTS_FOR_STRENGTH IS NOT NULL and
f.AWAY_GOALS_FOR_STRENGTH IS not NULL and
f.AWAY_SHOTS_FOR_STRENGTH IS not NULL and
--first season is skipped to have stable features
his.SEASON <> '2014_2015' and
--filter for iterator
to_number(to_char(his.MATCH_DATE,'yyyymm')) < """ + str(month_filter) + ";"

return dwh_conn.export_to_pandas(v_sql)


The training data frame is split into features and result classes. The values determined during the hyperparamater optimization are used to train the model.

df_train_data = get_training_data_comb(row['ITERATOR'])

#split features and pred classes
df_train_x = df_train_data.iloc[:,:24]
df_train_y = df_train_data.iloc[:,24:25]

#
#train logistic regression model
#
print('...training logistic regression model')

solvers = 'liblinear'
penalty = 'l2'
c_values = 1.0

lr_model = LogisticRegression(solver=solvers, penalty=penalty, C=c_values, random_state=0).fit(df_train_x, df_train_y)


Beside the model features, the prediction data set contains the specific ID of the football match (line 7). So we keep the information to which football match a prediction belongs. The data set contains all matches of the current iterator month. (line 50)

def get_pred_data_comb(month_filter):

#load training data based on provided filter
v_sql = """-- BETTING_MART.UNDERSTAT_TEAM_MODEL_FEATURES_EMA15 source
SELECT
f.football_match_his_lid,
--model features
f.HOME_GOALS_FOR_STRENGTH,
f.HOME_GOALS_AGAINST_STRENGTH,
f.HOME_XG_FOR_STRENGTH,
f.HOME_XG_AGAINST_STRENGTH,
f.HOME_SHOTS_FOR_STRENGTH,
f.HOME_SHOTS_AGAINST_STRENGTH,
f.HOME_SHOTS_ON_TARGET_FOR_STRENGTH,
f.HOME_SHOTS_ON_TARGET_AGAINST_STRENGTH,
f.HOME_PPDA_FOR_STRENGTH,
f.HOME_PPDA_AGAINST_STRENGTH,
f.HOME_DEEP_FOR_STRENGTH,
f.HOME_DEEP_AGAINST_STRENGTH,
f.AWAY_GOALS_FOR_STRENGTH,
f.AWAY_GOALS_AGAINST_STRENGTH,
f.AWAY_XG_FOR_STRENGTH,
f.AWAY_XG_AGAINST_STRENGTH,
f.AWAY_SHOTS_FOR_STRENGTH,
f.AWAY_SHOTS_AGAINST_STRENGTH,
f.AWAY_SHOTS_ON_TARGET_FOR_STRENGTH,
f.AWAY_SHOTS_ON_TARGET_AGAINST_STRENGTH,
f.AWAY_PPDA_FOR_STRENGTH,
f.AWAY_PPDA_AGAINST_STRENGTH,
f.AWAY_DEEP_FOR_STRENGTH,
f.AWAY_DEEP_AGAINST_STRENGTH
FROM
BETTING_DV.FOOTBALL_MATCH_HIS_B his
JOIN BETTING_DV.FOOTBALL_MATCH_HIS_L_S_UNDERSTAT_TEAM_STRENGTH_FEATURES_EMA15 f
ON his.FOOTBALL_MATCH_HIS_LID = f.FOOTBALL_MATCH_HIS_LID
JOIN RAW_DV.FOOTBALL_MATCH_HIS_L_S_UNDERSTAT_TEAM_STATS s
ON his.FOOTBALL_MATCH_HIS_LID = s.FOOTBALL_MATCH_HIS_LID
WHERE
--features have to exist
f.HOME_GOALS_FOR_STRENGTH IS NOT NULL AND
f.HOME_SHOTS_FOR_STRENGTH IS NOT NULL and
f.AWAY_GOALS_FOR_STRENGTH IS not NULL and
f.AWAY_SHOTS_FOR_STRENGTH IS not NULL and
--first season is skipped to have stable features
his.SEASON <> '2014_2015' and
--filter for iterator
to_number(to_char(his.MATCH_DATE,'yyyymm')) = """ + str(month_filter) + ";"

return dwh_conn.export_to_pandas(v_sql)


For the prediction the features need to be extracted from the prediction data frame. The predict_proba() function of the model is for the prediction (line 13). This function provides us the probability for all single classes. Otherwise we would just get the class with the highest probability. Afterwards I create a dataframe, which only contains the match ID, the predicted results, the model name as well as the current iterator. This set of predictions is written into a database table (line 24). After all models for all iterations were processed, this table contains the complete predictions, which can be used to analyse the performance of the single models in comparison to the betting market.

#
# predicting logistic regression
#

#get data for prediction
df_pred_data = get_pred_data_comb(row['ITERATOR'])

#do prediction
df_pred_x = df_pred_data.iloc[:,1:25]

print('...executing logistic regression prediction')

df_pred_y = lr_model.predict_proba(df_pred_x)

#create result data frame
df_result_data = pd.DataFrame(df_pred_data.iloc[:,0:1])
df_result_data['model'] = 'Logistic Regression EMA15 combined'
df_result_data['home_win_prob'] = df_pred_y[:,2:3]
df_result_data['draw_prob'] = df_pred_y[:,1:2]
df_result_data['away_win_prob'] = df_pred_y[:,0:1]
df_result_data['interator_param'] = str(row['ITERATOR'])

print('......data written to the db')
dwh_conn.import_from_pandas(df_result_data,('BETTING_DV','football_match_his_l_understat_pred_ema15'))


## Model performance

Measuring the Brier score of a new model is always on of my the first tasks, when it’s about determin the model performance. So it’s possible to rank all my different models. As a benchmark for the blog, we could compare them to the best Vanilla xG poisson from my last model set. These models used also xG data and the best model had a Brier score of 0.586.

$BS =\frac{{1}}{{N}} \sum\limits_{t=1}^N (f_t - o_t)^2$

The model performance was determined based on 8921 matches for the Big5 league. The number of matches per division varies a bit as e.g. Bundesliga only consists of 18 teams and e.g. the League 1 season during corona was suspended.

The accuracy during the parameter indicated the logistic regression as the best machine learning algorithm. That’s also represented by the calculated Brier score of the model. But suprisingly the XGBoost algorithm has got highest Brier score. Compared to the market odds, each model is worse. So you don’t have to expect a positive overall profit during the betting simulation. Compared to the Vanilla Poisson models, 4 models show a similar or better Brier score performance. Does the betting simulation profit show a similar profit?

## Betting results

The betting simulation was done with a flat stack of 1 unit. For each match and market the value based on the prediction was calculated. If the prediction probability of e.g. a home win is higher than the market probality, the bet was select. For each match just one bet is selected even if multipe markets showed value.

As expected all models show a negative overall profit. All models are not usefull for blind betting against the bookie. But the ranking is different to the Brier score. The random forest and XGBoost algorithm show a way better profit performance. Both show a similar loss like the best performing xG Poisson models.

Taking a look back to the intro, it’s also time to answer the opening question: Is it necessary to differ between home and away performance? Not, when using advanced machine learning algorithms. Random forest and XGBoost show the same performance in both cases. Brier score and overall profit indicate that the choosen algorithm has more influence on the model performance.

## Conclusions

Creating and testing these models was a bit sobering. You would image, while using such advanced statistics, the model performance would increase and there’s some kind of chance compete with the market. But that’s again not the case. And that should again be a warning for each recreational bettor, who thinks some football knowledge is enough to be a profitable bettor. Nevertheless there are some more conclusions to draw:

1. Again the Brier score of the models is not inline with overall profit of the simulation. The model with the best Brier score does not provide the best profit, although I would expect it, when having a simulation with multiple thousands of bets. Maybe it’s time to compare some metrics for the model performance.
2. When using machine learning algorithms it’s definitely not needed to differ between home and away performance. The results for the Random Forest and XGBoost models show no difference for both model types.
3. For the different parameters an expontial moving average based on the last 15 matches was used. As the Vanilla Poisson models already showed, the number of matches used for the parameters can have a big impact on the overall performance. There’s maybe still some room to improve the model. But of course the intention of the blog was to determine, whether there’s a need to differ between home and away matches.