How does my typical betting weekend looks like, when I start ckecking, whether there are some interesting matches? I start my laptop, open the browser, start my Python program, start the database and after some minutes, I am able to start my data prcoessing, which collects all the data and calculates the predictions. That’s already great, but wouldn’t it be even better to have all predictions always already up-to-date? This blog will show you how to setup and run a small automated data pipeline in AWS, which extracts all stats from Understat.com.

Continue reading “Automate your betting models with AWS”## Why every data scientist should learn SQL

It’s been quite a long time since my last post for my blog. But that has been because of a specific reason: I participated at the 2nd DFB Hackathon, which consumed a huge amount of my freetime, which I normally spent creating some content for my blog. The Hackathon was again a great experience as all this deep data science stuff is still a challenge for me. But there’s again on big question on my side: Why are data scientist often just using Python (or R) and don’t know, how and when to use SQL.

Continue reading “Why every data scientist should learn SQL”## xG data journey – scrapping dynamic webpages

In the first part of this data journey, I took a look at the general definition of expected goals (xG) and the usage of this metric. In the next step in the process of testing the predictive power of xG, I need to get some data. This part will focus on getting the team expected goals statistics. In one of the following parts, I will also take a look on getting the player expected goals statistics as this of course offers even deeper insights.

Continue reading “xG data journey – scrapping dynamic webpages”

## Define Variables: GS & PPG match rating

In the last post I described the predictive models, which will be explained in this series. Following the development process for predictive models, the next steps should handle the raw data supply for the predictive models. Fortunately football-data.co.uk already offers all data, which is needed for these models. So this post will explain, how you implement the features for the GS and PPG match rating models based on the existing Raw Data Vault model.

## Define variables: Brier score for market odds

While browsing the internet and looking for some new inspiration to build an own predictive model, I came upon a very interesting possible feature: the Brier score.

The Brier score is a possibility to measure the accuracy of a predictive model. It gets often used to measure the accuracy for weather forecasts. First I thought, I could use it as a kind of calibration feature for a predictive model. So that a predictive model recognizes, when it was too inaccurate in the past. But using it as a feature to detect teams, which can be predicted well by the bookies or which could cause unexpected results, seems to be a more promising approach. Therefor I want to explain in this post, how to calculate the Brier score based on the last betting odds for a specific team.

Continue reading “Define variables: Brier score for market odds”

## Implement Model: Poisson distribution

In the last post the prototype of the Poisson prediction model has proven, that the optimised model is suitable to beat the bookie – at least for the German Bundesliga. The next step in the predictive model development process consists of implementing the model for forecasting the current fixtures. Regarding this model this part is very easy, as you need not to implement a trained model, just the prediction logic.

## Define variables: attack & defence strength

During my first investigations for predicting football scores I came across the predictive models of Maher [1] and Dixon / Coles [2]. Maher modelled the number of goals a team scores during a match as two independent Poisson distributed variables, for the home team and the away team. He assumed that each team has an attacking strength and a defence strength. Dixon / Coles extended this model by adjusting some disadvantages of the Poisson distribution and by using a time dependent attack and defence strength. Both papers are the base of my first predictive model.

In this Post I want to describe, how the attack and defence strength are calculated and how you add this calculation to the existing Data Vault model. The predictive model itself will be explained in another post.

Continue reading “Define variables: attack & defence strength”