Define variables: attack & defence strength

During my first investigations for predicting football scores I came across the predictive models of Maher [1] and Dixon / Coles [2]. Maher modelled the number of goals a team scores during a match as two independent Poisson distributed variables, for the home team and the away team. He assumed that each team has an attacking strength and a defence strength. Dixon / Coles extended this model by adjusting some disadvantages of the Poisson distribution and by using a time dependent attack and defence strength. Both papers are the base of my first predictive model.

In this Post I want to describe, how the attack and defence strength are calculated and how you add this calculation to the existing Data Vault model. The predictive model itself will be explained in another post.

Attack & defence strength

Following probability distribution function was used by Maher:

 

P( x_{ij}  = x, Y_{ij} = y | \alpha,  \beta, k) = Poisson(x | k * \alpha_{i} \beta_{j}) * Poisson(x | \alpha_{j} \beta_{i})

 

The team has an attack strength \alpha_{i}  and defence strength \beta_{i} . A correction factor for the home advantage is represented by the parameter k. The more goals a team scores in comparison to the league average, the higher is the attack strength. For the defence strength it is inverse. The better a defence is or the fewer goals a team concedes, the smaller is the defence strength.

For my predictive model, I did not use a home field advantage factor. Instead, I calculated the attack and defence strength separate for home and away teams:

 

HomeAttackStrength = AvgHomeGoalsScored(X, team_{i}) / AvgHomeGoalsScored(X)

HomeDefenceStrength = AvgHomeGoalsConceded(X, team_{i}) / AvgHomeGoalsConceded(X)

AwayAttackStrength = AvgAwayGoalsScored(X, team_{j}) / AvgAwayGoalsScored(X)

AwayDefenceStrength = AvgAwayGoalsConceded(X, team_{i}) / AvgAwayGoalsConceded(X)

 

The parameter X defines the number of games over which the average should be determined. So the home attack strength is the ratio of the averaged scored goals of team i in the last X games to the average scored goals of the complete league in the last X games. So X specifies the time dependency. In comparison to Dixon / Coles I do not use a weighting method to give younger games a higher weight. This is definitely an optimization opportunity.

 

SQL implementation

As I now know, what is the definition of my variables, I can start writing a SQL statement, which calculates these variables. In a first step, I will explain, how the attack and defence strength for a home team is queried. After that I show you, how you calculate the variables for each historic match.

For calculating different features I need to know, which team played in which division in a specific season. Because this information is needed often, I decided to implement this relation as an additional link table. The link table is basically just an aggregation of the football match link table.

After the link table was created, I can use e.g. following statement to determine the current home attack and home defence strength for each team of the Bundesliga based on the last 30 games:

select
	home_team,
	avg_team_goals_for,
	avg_team_goals_against,
	avg_league_goals_for,
	avg_league_goals_against,
	round(avg_team_goals_for / avg_league_goals_for, 2) home_attacking_strength,
	round(avg_team_goals_against / avg_league_goals_against, 2) home_defence_strength
from
	(
	select distinct
		home_team,
		avg(goals_for) over (partition by home_team) avg_team_goals_for,
		avg(goals_against) over (partition by home_team) avg_team_goals_against,
		avg(goals_for) over (partition by 1) avg_league_goals_for,
		avg(goals_against) over (partition by 1) avg_league_goals_against
	from
		(
		select
			his.match_date his_match_date,
			home_team.team home_team,
			stat.full_time_home_goals goals_for,
			stat.full_time_away_goals goals_against,
			dense_rank() over (partition by home_team.team order by his.match_date desc) rang_team
		from
			analytical layer.football_team_division_l team_division
			join analytical layer.football_match_his_l his
				on team_division.football_team_hid = his.football_team_home_hid and
				team_division.football_division_hid = his.football_division_hid
			join analytical layer.football_division_h division
				on team_division.football_division_hid = division.football_division_hid
			join analytical layer.football_team_h home_team
				on team_division.football_team_hid = home_team.football_team_hid
			join analytical layer.football_season_h season
				on team_division.football_season_hid = season.football_season_hid
			join analytical layer.football_match_his_l_s_statistic stat
				on his.football_match_his_lid = stat.football_match_his_lid
		where
			division.division = 'D1' and
			season.season = '2016_2017'
		)
	where
		rang_team <= 30
	)

The inner select (lines 19-40) is used to determine the past home game results of all current teams of the Bundesliga. The analytical function DENSE_RANK calculates the rank over the number of home matches, so that I am able to filter the specific number of matches in the past, which I want to use for the average calculation. In this example I filter for the last 30 home games.

The next both parts are easy. Based on the filtered number of home matches for each team, the statement calculates first the average number of scored and conceded goals for the team and the whole league (lines 11-16). After that, the attack and defence strength are calculated (lines 1-8).

Calculating the away attack and defence strength looks the same except, that you have to use the away team to determine all away matches.

 

Extend Data Vault model

Unfortunately being able to calculate the current attack and defence strength is not enough. As I want to simulate predictive models in the past, I need to know the attack and defence strength at any possible date. Therefore I implemented the feature calculation as a separate satellite table. By doing this, I am able to use this features in different predictive models. I could also create several satellites with different numbers of included games.

 

Analytical DV wo current
Extended Data Vault model with football team link and attack & defence strength satellite

 

The SQL statement of the first part is the base for the new satellite table. To integrate such a statement as a new table, you have to use the hash-keys instead of names for the team, season and division. Following statement creates a satellite table, which uses the last 30 games to calculate the attack and defence strength:

create or replace view betting_dv.football_match_his_l_s_attack_defence_strength
as
select
	fixtures.football_match_his_lid,
	round(home_str.avg_team_goals_for,2) avg_home_team_goals_for,
	round(home_str.avg_team_goals_against,2) avg_home_team_goals_against,
	round(away_str.avg_team_goals_for,2)	avg_away_team_goals_for,
	round(away_str.avg_team_goals_against,2) avg_away_team_goals_against,
	round(home_str.avg_league_goals_for,2) avg_league_home_goals_for,
	round(home_str.avg_league_goals_against,2) avg_league_home_goals_against,
	round(away_str.avg_league_goals_for,2) avg_league_away_goals_for,
	round(away_str.avg_league_goals_against,2) avg_league_away_goals_against,
	round(home_str.home_attacking_strength,2) home_attacking_strength,
	round(home_str.home_defence_strength,2) home_defence_strength,
	round(away_str.away_attacking_strength,2) away_attacking_strength,
	round(away_str.away_defence_strength,2) away_defence_strength
from
	--match combinations
	betting_dv.football_match_his_l fixtures
	--home strength weakness
	join
		(
		select
			football_division_hid,
			football_season_hid,
			match_date,
			football_team_home_hid,
			avg_team_goals_for,
			avg_team_goals_against,
			avg_league_goals_for,
			avg_league_goals_against,
			round(avg_team_goals_for / avg_league_goals_for, 2) home_attacking_strength,
			round(avg_team_goals_against / avg_league_goals_against, 2) home_defence_strength
		from
			(
			select distinct
				football_division_hid,
				football_season_hid,
				match_date,
				football_team_home_hid,
				avg(goals_for) over (partition by football_team_home_hid) avg_team_goals_for,
				avg(goals_against) over (partition by football_team_home_hid) avg_team_goals_against,
				avg(goals_for) over (partition by 1) avg_league_goals_for,
				avg(goals_against) over (partition by 1) avg_league_goals_against
			from
				(
				select
					dates.football_division_hid,
					dates.football_season_hid,
					dates.match_date,
					his.football_team_home_hid,
					stat.full_time_home_goals goals_for,
					stat.full_time_away_goals goals_against,
					dense_rank() over (partition by his.football_team_home_hid order by his.match_date desc) rang_team
				from
					(
					select distinct
						football_division_hid,
						football_season_hid,
						match_date
					from
						analytical_layer.football_match_his_l
					) dates
					join analytical layer.football_team_division_l team_division
						on dates.football_season_hid = team_division.football_season_hid and
							dates.football_division_hid = team_division.football_division_hid
					join analytical layer.football_match_his_l his
						on 	dates.football_division_hid = his.football_division_hid and
							dates.match_date &gt; his.match_date and
							team_division.football_team_hid = his.football_team_home_hid
					join analytical_layer.football_match_his_l_s_statistic stat
						on his.football_match_his_lid = stat.football_match_his_lid
				)
			where
				rang_team = 30
			)
		) home_str
	on (
		fixtures.football_division_hid 	= home_str.football_division_hid and
		fixtures.football_season_hid 	= home_str.football_season_hid and
		fixtures.match_date				= home_str.match_date and
		fixtures.football_team_home_hid = home_str.football_team_home_hid
		)
	--away strength weakness
	join
		(
		select
			football_division_hid,
			football_season_hid,
			match_date,
			football_team_away_hid,
			avg_team_goals_for,
			avg_team_goals_against,
			avg_league_goals_for,
			avg_league_goals_against,
			round(avg_team_goals_for / avg_league_goals_for, 2) away_attacking_strength,
			round(avg_team_goals_against / avg_league_goals_against, 2) away_defence_strength
		from
			(
			select distinct
				football_division_hid,
				football_season_hid,
				match_date,
				football_team_away_hid,
				avg(goals_for) over (partition by football_team_away_hid) avg_team_goals_for,
				avg(goals_against) over (partition by football_team_away_hid) avg_team_goals_against,
				avg(goals_for) over (partition by 1) avg_league_goals_for,
				avg(goals_against) over (partition by 1) avg_league_goals_against
			from
				(
				select
					dates.football_division_hid,
					dates.football_season_hid,
					dates.match_date,
					his.football_team_away_hid,
					stat.full_time_home_goals goals_for,
					stat.full_time_away_goals goals_against,
					dense_rank() over (partition by his.football_team_away_hid order by his.match_date desc) rang_team
				from
					(
					select distinct
						football_division_hid,
						football_season_hid,
						match_date
					from
						analytical_layer.football_match_his_l
					) dates
					join analytical layer.football_team_division_l team_division
						on dates.football_season_hid = team_division.football_season_hid and
							dates.football_division_hid = team_division.football_division_hid
					join analytical layer.football_match_his_l his
						on 	dates.football_division_hid = his.football_division_hid and
							dates.match_date &gt; his.match_date and
							team_division.football_team_hid = his.football_team_away_hid
					join analytical_layer.football_match_his_l_s_statistic stat
						on his.football_match_his_lid = stat.football_match_his_lid
				)
			where
				rang_team &lt;= 30
			)
		) away_str
	on (
		fixtures.football_division_hid 	= away_str.football_division_hid and
		fixtures.football_season_hid 	= away_str.football_season_hid and
		fixtures.match_date				= away_str.match_date and
		fixtures.football_team_away_hid = away_str.football_team_away_hid
		)

The base statement is doubled (lines 23-81 & lines 91-149) to query the attack and defence strength for the home and the away team. The inner select of the base statement is extended with a subselect (lines 61-67), which determines each match date for each season and division. By doing that, I create a list of all home or away games, which happened before a specific match date.

I already mentioned, that you have to use hash keys instead of meaningful names (e.g. line 55), so that the satellite table can be integrated into the existing Data Vault model. The satellite table should be connected to the historic match link table. That’s why I have to join the result of the both sub selects for the home and away team with the original link table (line 19). So I am able to determine  the link hash key for each historic match (line 4).

After all these steps, we have an extended Data Vault model, which provides the defence and attack strength at any point in history. This is the perfect starting point to develop and test the first predictive model as I am able to simulate the model over several seasons for several divisions. This will be part of the next post. There I will describe, how you can use these variables to predict football scores with help of the Poisson distribution. I will also explain the disadvantages of the Poisson distribution and how I resolved them.

 

 

If you have further questions, feel free to leave a comment or contact me @Mo_Nbg.

 

References:

[1]  Mike J. Maher. Modelling association football scores. Statistica Neerlandica, 36(3):109–118, 1982

[2] Mark J. Dixon and Stuart G. Coles. Modeling association football scores and inefficiencies in the football betting market. Applied Statistics, 46:265–280, 1997.

7 Replies to “Define variables: attack & defence strength”

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: