During my first investigations for predicting football scores I came across the predictive models of Maher [1] and Dixon / Coles [2]. Maher modelled the number of goals a team scores during a match as two independent Poisson distributed variables, for the home team and the away team. He assumed that each team has an attacking strength and a defence strength. Dixon / Coles extended this model by adjusting some disadvantages of the Poisson distribution and by using a time dependent attack and defence strength. Both papers are the base of my first predictive model.
In this Post I want to describe, how the attack and defence strength are calculated and how you add this calculation to the existing Data Vault model. The predictive model itself will be explained in another post.
Attack & defence strength
Following probability distribution function was used by Maher:
The team i has an attack strength and defence strength
. A correction factor for the home advantage is represented by the parameter k. The more goals a team scores in comparison to the league average, the higher is the attack strength. For the defence strength it is inverse. The better a defence is or the fewer goals a team concedes, the smaller is the defence strength.
For my predictive model, I did not use a home field advantage factor. Instead, I calculated the attack and defence strength separate for home and away teams:
The parameter X defines the number of games over which the average should be determined. So the home attack strength is the ratio of the averaged scored goals of team i in the last X games to the average scored goals of the complete league in the last X games. So X specifies the time dependency. In comparison to Dixon / Coles I do not use a weighting method to give younger games a higher weight. This is definitely an optimization opportunity.
SQL implementation
As I now know, what is the definition of my variables, I can start writing a SQL statement, which calculates these variables. In a first step, I will explain, how the attack and defence strength for a home team is queried. After that I show you, how you calculate the variables for each historic match.
For calculating different features I need to know, which team played in which division in a specific season. Because this information is needed often, I decided to implement this relation as an additional link table. The link table is basically just an aggregation of the football match link table.
create or replace view analytical_layer.football_team_division_l as select distinct hash_md5(football_season_hid || '#' || football_division_hid || '#' || football_team_home_hid) football_team_division_lid, football_season_hid, football_division_hid, football_team_home_hid football_team_hid from raw_layer.football_match_his_l union select distinct hash_md5(football_season_hid || '#' || football_division_hid || '#' || football_team_away_hid) football_team_division_lid, football_season_hid, football_division_hid, football_team_away_hid from raw_layer.football_match_his_l
After the link table was created, I can use e.g. following statement to determine the current home attack and home defence strength for each team of the Bundesliga based on the last 30 games:
select home_team, avg_team_goals_for, avg_team_goals_against, avg_league_goals_for, avg_league_goals_against, round(avg_team_goals_for / avg_league_goals_for, 2) home_attacking_strength, round(avg_team_goals_against / avg_league_goals_against, 2) home_defence_strength from ( select distinct home_team, avg(goals_for) over (partition by home_team) avg_team_goals_for, avg(goals_against) over (partition by home_team) avg_team_goals_against, avg(goals_for) over (partition by 1) avg_league_goals_for, avg(goals_against) over (partition by 1) avg_league_goals_against from ( select his.match_date his_match_date, home_team.team home_team, stat.full_time_home_goals goals_for, stat.full_time_away_goals goals_against, dense_rank() over (partition by home_team.team order by his.match_date desc) rang_team from analytical layer.football_team_division_l team_division join analytical layer.football_match_his_l his on team_division.football_team_hid = his.football_team_home_hid and team_division.football_division_hid = his.football_division_hid join analytical layer.football_division_h division on team_division.football_division_hid = division.football_division_hid join analytical layer.football_team_h home_team on team_division.football_team_hid = home_team.football_team_hid join analytical layer.football_season_h season on team_division.football_season_hid = season.football_season_hid join analytical layer.football_match_his_l_s_statistic stat on his.football_match_his_lid = stat.football_match_his_lid where division.division = 'D1' and season.season = '2016_2017' ) where rang_team <= 30 )
The inner select (lines 19-40) is used to determine the past home game results of all current teams of the Bundesliga. The analytical function DENSE_RANK calculates the rank over the number of home matches, so that I am able to filter the specific number of matches in the past, which I want to use for the average calculation. In this example I filter for the last 30 home games.
The next both parts are easy. Based on the filtered number of home matches for each team, the statement calculates first the average number of scored and conceded goals for the team and the whole league (lines 11-16). After that, the attack and defence strength are calculated (lines 1-8).
Calculating the away attack and defence strength looks the same except, that you have to use the away team to determine all away matches.
Extend Data Vault model
Unfortunately being able to calculate the current attack and defence strength is not enough. As I want to simulate predictive models in the past, I need to know the attack and defence strength at any possible date. Therefore I implemented the feature calculation as a separate satellite table. By doing this, I am able to use this features in different predictive models. I could also create several satellites with different numbers of included games.

The SQL statement of the first part is the base for the new satellite table. To integrate such a statement as a new table, you have to use the hash-keys instead of names for the team, season and division. Following statement creates a satellite table, which uses the last 30 games to calculate the attack and defence strength:
create or replace view betting_dv.football_match_his_l_s_attack_defence_strength as select fixtures.football_match_his_lid, round(home_str.avg_team_goals_for,2) avg_home_team_goals_for, round(home_str.avg_team_goals_against,2) avg_home_team_goals_against, round(away_str.avg_team_goals_for,2) avg_away_team_goals_for, round(away_str.avg_team_goals_against,2) avg_away_team_goals_against, round(home_str.avg_league_goals_for,2) avg_league_home_goals_for, round(home_str.avg_league_goals_against,2) avg_league_home_goals_against, round(away_str.avg_league_goals_for,2) avg_league_away_goals_for, round(away_str.avg_league_goals_against,2) avg_league_away_goals_against, round(home_str.home_attacking_strength,2) home_attacking_strength, round(home_str.home_defence_strength,2) home_defence_strength, round(away_str.away_attacking_strength,2) away_attacking_strength, round(away_str.away_defence_strength,2) away_defence_strength from --match combinations betting_dv.football_match_his_l fixtures --home strength weakness join ( select football_division_hid, football_season_hid, match_date, football_team_home_hid, avg_team_goals_for, avg_team_goals_against, avg_league_goals_for, avg_league_goals_against, round(avg_team_goals_for / avg_league_goals_for, 2) home_attacking_strength, round(avg_team_goals_against / avg_league_goals_against, 2) home_defence_strength from ( select distinct football_division_hid, football_season_hid, match_date, football_team_home_hid, avg(goals_for) over (partition by football_team_home_hid) avg_team_goals_for, avg(goals_against) over (partition by football_team_home_hid) avg_team_goals_against, avg(goals_for) over (partition by 1) avg_league_goals_for, avg(goals_against) over (partition by 1) avg_league_goals_against from ( select dates.football_division_hid, dates.football_season_hid, dates.match_date, his.football_team_home_hid, stat.full_time_home_goals goals_for, stat.full_time_away_goals goals_against, dense_rank() over (partition by his.football_team_home_hid order by his.match_date desc) rang_team from ( select distinct football_division_hid, football_season_hid, match_date from analytical_layer.football_match_his_l ) dates join analytical layer.football_team_division_l team_division on dates.football_season_hid = team_division.football_season_hid and dates.football_division_hid = team_division.football_division_hid join analytical layer.football_match_his_l his on dates.football_division_hid = his.football_division_hid and dates.match_date > his.match_date and team_division.football_team_hid = his.football_team_home_hid join analytical_layer.football_match_his_l_s_statistic stat on his.football_match_his_lid = stat.football_match_his_lid ) where rang_team = 30 ) ) home_str on ( fixtures.football_division_hid = home_str.football_division_hid and fixtures.football_season_hid = home_str.football_season_hid and fixtures.match_date = home_str.match_date and fixtures.football_team_home_hid = home_str.football_team_home_hid ) --away strength weakness join ( select football_division_hid, football_season_hid, match_date, football_team_away_hid, avg_team_goals_for, avg_team_goals_against, avg_league_goals_for, avg_league_goals_against, round(avg_team_goals_for / avg_league_goals_for, 2) away_attacking_strength, round(avg_team_goals_against / avg_league_goals_against, 2) away_defence_strength from ( select distinct football_division_hid, football_season_hid, match_date, football_team_away_hid, avg(goals_for) over (partition by football_team_away_hid) avg_team_goals_for, avg(goals_against) over (partition by football_team_away_hid) avg_team_goals_against, avg(goals_for) over (partition by 1) avg_league_goals_for, avg(goals_against) over (partition by 1) avg_league_goals_against from ( select dates.football_division_hid, dates.football_season_hid, dates.match_date, his.football_team_away_hid, stat.full_time_home_goals goals_for, stat.full_time_away_goals goals_against, dense_rank() over (partition by his.football_team_away_hid order by his.match_date desc) rang_team from ( select distinct football_division_hid, football_season_hid, match_date from analytical_layer.football_match_his_l ) dates join analytical layer.football_team_division_l team_division on dates.football_season_hid = team_division.football_season_hid and dates.football_division_hid = team_division.football_division_hid join analytical layer.football_match_his_l his on dates.football_division_hid = his.football_division_hid and dates.match_date > his.match_date and team_division.football_team_hid = his.football_team_away_hid join analytical_layer.football_match_his_l_s_statistic stat on his.football_match_his_lid = stat.football_match_his_lid ) where rang_team <= 30 ) ) away_str on ( fixtures.football_division_hid = away_str.football_division_hid and fixtures.football_season_hid = away_str.football_season_hid and fixtures.match_date = away_str.match_date and fixtures.football_team_away_hid = away_str.football_team_away_hid )
The base statement is doubled (lines 23-81 & lines 91-149) to query the attack and defence strength for the home and the away team. The inner select of the base statement is extended with a subselect (lines 61-67), which determines each match date for each season and division. By doing that, I create a list of all home or away games, which happened before a specific match date.
I already mentioned, that you have to use hash keys instead of meaningful names (e.g. line 55), so that the satellite table can be integrated into the existing Data Vault model. The satellite table should be connected to the historic match link table. That’s why I have to join the result of the both sub selects for the home and away team with the original link table (line 19). So I am able to determine the link hash key for each historic match (line 4).
After all these steps, we have an extended Data Vault model, which provides the defence and attack strength at any point in history. This is the perfect starting point to develop and test the first predictive model as I am able to simulate the model over several seasons for several divisions. This will be part of the next post. There I will describe, how you can use these variables to predict football scores with help of the Poisson distribution. I will also explain the disadvantages of the Poisson distribution and how I resolved them.
If you have further questions, feel free to leave a comment or contact me @Mo_Nbg.
References:
[1] Mike J. Maher. Modelling association football scores. Statistica Neerlandica, 36(3):109–118, 1982
[2] Mark J. Dixon and Stuart G. Coles. Modeling association football scores and inefficiencies in the football betting market. Applied Statistics, 46:265–280, 1997.
7 Replies to “Define variables: attack & defence strength”