# Define variables: attack & defence strength

During my first investigations for predicting football scores I came across the predictive models of Maher [1] and Dixon / Coles [2]. Maher modelled the number of goals a team scores during a match as two independent Poisson distributed variables, for the home team and the away team. He assumed that each team has an attacking strength and a defence strength. Dixon / Coles extended this model by adjusting some disadvantages of the Poisson distribution and by using a time dependent attack and defence strength. Both papers are the base of my first predictive model.

In this Post I want to describe, how the attack and defence strength are calculated and how you add this calculation to the existing Data Vault model. The predictive model itself will be explained in another post.

## Attack & defence strength

Following probability distribution function was used by Maher:

$P( x_{ij} = x, Y_{ij} = y | \alpha, \beta, k) = Poisson(x | k * \alpha_{i} \beta_{j}) * Poisson(x | \alpha_{j} \beta_{i})$

The team has an attack strength $\alpha_{i}$ and defence strength $\beta_{i}$. A correction factor for the home advantage is represented by the parameter k. The more goals a team scores in comparison to the league average, the higher is the attack strength. For the defence strength it is inverse. The better a defence is or the fewer goals a team concedes, the smaller is the defence strength.

For my predictive model, I did not use a home field advantage factor. Instead, I calculated the attack and defence strength separate for home and away teams:

$HomeAttackStrength = AvgHomeGoalsScored(X, team_{i}) / AvgHomeGoalsScored(X)$

$HomeDefenceStrength = AvgHomeGoalsConceded(X, team_{i}) / AvgHomeGoalsConceded(X)$

$AwayAttackStrength = AvgAwayGoalsScored(X, team_{j}) / AvgAwayGoalsScored(X)$

$AwayDefenceStrength = AvgAwayGoalsConceded(X, team_{i}) / AvgAwayGoalsConceded(X)$

The parameter X defines the number of games over which the average should be determined. So the home attack strength is the ratio of the averaged scored goals of team i in the last X games to the average scored goals of the complete league in the last X games. So X specifies the time dependency. In comparison to Dixon / Coles I do not use a weighting method to give younger games a higher weight. This is definitely an optimization opportunity.

## SQL implementation

As I now know, what is the definition of my variables, I can start writing a SQL statement, which calculates these variables. In a first step, I will explain, how the attack and defence strength for a home team is queried. After that I show you, how you calculate the variables for each historic match.

For calculating different features I need to know, which team played in which division in a specific season. Because this information is needed often, I decided to implement this relation as an additional link table. The link table is basically just an aggregation of the football match link table.

After the link table was created, I can use e.g. following statement to determine the current home attack and home defence strength for each team of the Bundesliga based on the last 30 games:

select
home_team,
avg_team_goals_for,
avg_team_goals_against,
avg_league_goals_for,
avg_league_goals_against,
round(avg_team_goals_for / avg_league_goals_for, 2) home_attacking_strength,
round(avg_team_goals_against / avg_league_goals_against, 2) home_defence_strength
from
(
select distinct
home_team,
avg(goals_for) over (partition by home_team) avg_team_goals_for,
avg(goals_against) over (partition by home_team) avg_team_goals_against,
avg(goals_for) over (partition by 1) avg_league_goals_for,
avg(goals_against) over (partition by 1) avg_league_goals_against
from
(
select
his.match_date his_match_date,
home_team.team home_team,
stat.full_time_home_goals goals_for,
stat.full_time_away_goals goals_against,
dense_rank() over (partition by home_team.team order by his.match_date desc) rang_team
from
analytical layer.football_team_division_l team_division
join analytical layer.football_match_his_l his
on team_division.football_team_hid = his.football_team_home_hid and
team_division.football_division_hid = his.football_division_hid
join analytical layer.football_division_h division
on team_division.football_division_hid = division.football_division_hid
join analytical layer.football_team_h home_team
on team_division.football_team_hid = home_team.football_team_hid
join analytical layer.football_season_h season
on team_division.football_season_hid = season.football_season_hid
join analytical layer.football_match_his_l_s_statistic stat
on his.football_match_his_lid = stat.football_match_his_lid
where
division.division = 'D1' and
season.season = '2016_2017'
)
where
rang_team <= 30
)


The inner select (lines 19-40) is used to determine the past home game results of all current teams of the Bundesliga. The analytical function DENSE_RANK calculates the rank over the number of home matches, so that I am able to filter the specific number of matches in the past, which I want to use for the average calculation. In this example I filter for the last 30 home games.

The next both parts are easy. Based on the filtered number of home matches for each team, the statement calculates first the average number of scored and conceded goals for the team and the whole league (lines 11-16). After that, the attack and defence strength are calculated (lines 1-8).

Calculating the away attack and defence strength looks the same except, that you have to use the away team to determine all away matches.

## Extend Data Vault model

Unfortunately being able to calculate the current attack and defence strength is not enough. As I want to simulate predictive models in the past, I need to know the attack and defence strength at any possible date. Therefore I implemented the feature calculation as a separate satellite table. By doing this, I am able to use this features in different predictive models. I could also create several satellites with different numbers of included games.

The SQL statement of the first part is the base for the new satellite table. To integrate such a statement as a new table, you have to use the hash-keys instead of names for the team, season and division. Following statement creates a satellite table, which uses the last 30 games to calculate the attack and defence strength:

create or replace view betting_dv.football_match_his_l_s_attack_defence_strength
as
select
fixtures.football_match_his_lid,
round(home_str.avg_team_goals_for,2) avg_home_team_goals_for,
round(home_str.avg_team_goals_against,2) avg_home_team_goals_against,
round(away_str.avg_team_goals_for,2)	avg_away_team_goals_for,
round(away_str.avg_team_goals_against,2) avg_away_team_goals_against,
round(home_str.avg_league_goals_for,2) avg_league_home_goals_for,
round(home_str.avg_league_goals_against,2) avg_league_home_goals_against,
round(away_str.avg_league_goals_for,2) avg_league_away_goals_for,
round(away_str.avg_league_goals_against,2) avg_league_away_goals_against,
round(home_str.home_attacking_strength,2) home_attacking_strength,
round(home_str.home_defence_strength,2) home_defence_strength,
round(away_str.away_attacking_strength,2) away_attacking_strength,
round(away_str.away_defence_strength,2) away_defence_strength
from
--match combinations
betting_dv.football_match_his_l fixtures
--home strength weakness
join
(
select
football_division_hid,
football_season_hid,
match_date,
football_team_home_hid,
avg_team_goals_for,
avg_team_goals_against,
avg_league_goals_for,
avg_league_goals_against,
round(avg_team_goals_for / avg_league_goals_for, 2) home_attacking_strength,
round(avg_team_goals_against / avg_league_goals_against, 2) home_defence_strength
from
(
select distinct
football_division_hid,
football_season_hid,
match_date,
football_team_home_hid,
avg(goals_for) over (partition by football_team_home_hid) avg_team_goals_for,
avg(goals_against) over (partition by football_team_home_hid) avg_team_goals_against,
avg(goals_for) over (partition by 1) avg_league_goals_for,
avg(goals_against) over (partition by 1) avg_league_goals_against
from
(
select
dates.football_division_hid,
dates.football_season_hid,
dates.match_date,
his.football_team_home_hid,
stat.full_time_home_goals goals_for,
stat.full_time_away_goals goals_against,
dense_rank() over (partition by his.football_team_home_hid order by his.match_date desc) rang_team
from
(
select distinct
football_division_hid,
football_season_hid,
match_date
from
analytical_layer.football_match_his_l
) dates
join analytical layer.football_team_division_l team_division
on dates.football_season_hid = team_division.football_season_hid and
dates.football_division_hid = team_division.football_division_hid
join analytical layer.football_match_his_l his
on 	dates.football_division_hid = his.football_division_hid and
dates.match_date &gt; his.match_date and
team_division.football_team_hid = his.football_team_home_hid
join analytical_layer.football_match_his_l_s_statistic stat
on his.football_match_his_lid = stat.football_match_his_lid
)
where
rang_team = 30
)
) home_str
on (
fixtures.football_division_hid 	= home_str.football_division_hid and
fixtures.football_season_hid 	= home_str.football_season_hid and
fixtures.match_date				= home_str.match_date and
fixtures.football_team_home_hid = home_str.football_team_home_hid
)
--away strength weakness
join
(
select
football_division_hid,
football_season_hid,
match_date,
football_team_away_hid,
avg_team_goals_for,
avg_team_goals_against,
avg_league_goals_for,
avg_league_goals_against,
round(avg_team_goals_for / avg_league_goals_for, 2) away_attacking_strength,
round(avg_team_goals_against / avg_league_goals_against, 2) away_defence_strength
from
(
select distinct
football_division_hid,
football_season_hid,
match_date,
football_team_away_hid,
avg(goals_for) over (partition by football_team_away_hid) avg_team_goals_for,
avg(goals_against) over (partition by football_team_away_hid) avg_team_goals_against,
avg(goals_for) over (partition by 1) avg_league_goals_for,
avg(goals_against) over (partition by 1) avg_league_goals_against
from
(
select
dates.football_division_hid,
dates.football_season_hid,
dates.match_date,
his.football_team_away_hid,
stat.full_time_home_goals goals_for,
stat.full_time_away_goals goals_against,
dense_rank() over (partition by his.football_team_away_hid order by his.match_date desc) rang_team
from
(
select distinct
football_division_hid,
football_season_hid,
match_date
from
analytical_layer.football_match_his_l
) dates
join analytical layer.football_team_division_l team_division
on dates.football_season_hid = team_division.football_season_hid and
dates.football_division_hid = team_division.football_division_hid
join analytical layer.football_match_his_l his
on 	dates.football_division_hid = his.football_division_hid and
dates.match_date &gt; his.match_date and
team_division.football_team_hid = his.football_team_away_hid
join analytical_layer.football_match_his_l_s_statistic stat
on his.football_match_his_lid = stat.football_match_his_lid
)
where
rang_team &lt;= 30
)
) away_str
on (
fixtures.football_division_hid 	= away_str.football_division_hid and
fixtures.football_season_hid 	= away_str.football_season_hid and
fixtures.match_date				= away_str.match_date and
fixtures.football_team_away_hid = away_str.football_team_away_hid
)


The base statement is doubled (lines 23-81 & lines 91-149) to query the attack and defence strength for the home and the away team. The inner select of the base statement is extended with a subselect (lines 61-67), which determines each match date for each season and division. By doing that, I create a list of all home or away games, which happened before a specific match date.

I already mentioned, that you have to use hash keys instead of meaningful names (e.g. line 55), so that the satellite table can be integrated into the existing Data Vault model. The satellite table should be connected to the historic match link table. That’s why I have to join the result of the both sub selects for the home and away team with the original link table (line 19). So I am able to determine  the link hash key for each historic match (line 4).

After all these steps, we have an extended Data Vault model, which provides the defence and attack strength at any point in history. This is the perfect starting point to develop and test the first predictive model as I am able to simulate the model over several seasons for several divisions. This will be part of the next post. There I will describe, how you can use these variables to predict football scores with help of the Poisson distribution. I will also explain the disadvantages of the Poisson distribution and how I resolved them.