No matter what predictive model you want to build, you have to go through several steps. You find many different approaches to describe such a development process for statistical models or predictive models in the internet. I have chosen a relative simple one, which is based on papers for a SAS training.
This development process consists of 6 single steps, which will be iterativly repeated for each new predictive model you want to develop. Single steps may be shortened, as you can reuse the source data of one predictive model for another one. But basically you have to take a look at each development step.
In a first step you define your objective, your problem, which you want to solve. OK, at a first look, this one is easy. I want to predict football games. But what do I want to predict? The exact result? E.g. that Team A beats Team B with the result 3:1. Or do I just want to know, whether Team A wins or Team B? You have to be sure, what you want to predict. A predictive model, which tells you the probability for the exact result 3:1, can look different to a predictive model, which forecasts just the probability for a home win.
After you defined your objective, you have to look for data to solve the problem. And I can tell you: You need a lot of data! The more data you get, the better! It is not enough to get the results of the German Bundesliga for the current season. You need many years of data, as you must train and validate a predictive model. And most important: You need historic betting odds! Otherwise, it is not possible to simulate a predictive model against a Bookie.
The internet is of course the best data source. There are several sites, which I spotted during my investigations and which offer complete datasets with sports data. Another possibility is to extract the data from websites by using web crawler. I will release some posts, in which I explain, which data sources I use and how I imported the data to my analytical system.
At the 3rd step you need to check the quality of the data you collected. Are there any gaps? Is any data missing? Are different data sources compatible?
Fortunately the data quality mostly never was a problem for me until now. A spot of work was to integrate different data sources at the Raw Data Layer and build the Data Vault model. With an integrated model it is easy to use data of different source for a predictive model.
Now you have to start defining your variables / features for your predictive model. These features are the input for your predictive model. There is a simple rule: If you put garbage in – you will only get garbage to come out!
From my point of view, this is most important step. A good predictive model requires a good feature selection. During the first months I learned much about, which possible features exists and which ones correlate with my objective. You have to build up some in-depth knowledge about the specific sport and its characteristics. E.g. what we all learned at Champions League Final 2012 between Bayern Munich and Chelsea London: Possession does not win you games. So this is not a good feature for a model to predict football results.
Process & Validate Model
Now starts the fun part of the development process. All data is available, the features are defined and calculated. So you begin to build the model, test it and simulate it against the odds of the bookie.
As I am no mathematician or statistician genius, this part is a bit trail and error. If e.g. a linear regression model does not really fit, I try to improve it step by step: Switching to a robust linear regression, using the polynomial of the feature. All should be tested. I have to say: Google is my friend! The R community is really great!
Implement & Maintain Model
After you have found a successful model, you can add it to your automated standard process. Therefore, you have to migrate the R code to a UDF for Exasol. So you can use the predictive model inside the database and based on that build some reports, which list you all the bets you should play.
While you are using the predictive model to select bets, you always should monitor the future result of your model. So you are able to indentify possible problems for future predictions.
In my future posts I will always reference to the different steps of this developing process. So you are able to build a model on your own. The different models, which I tested or which I will test, are just examples. E.g. you can use the same features for different predictive models.
If you have further questions, feel free to leave a comment or contact me @Mo_Nbg.