In the first part of this data journey, I took a look at the general definition of expected goals (xG) and the usage of this metric. In the next step in the process of testing the predictive power of xG, I need to get some data. This part will focus on getting the team expected goals statistics. In one of the following parts, I will also take a look on getting the player expected goals statistics as this of course offers even deeper insights.
When a rich club in Germany goes through a bad performance phase or loses an important match, we like to use the phrase “Geld schießt eben keine Tore”. What means more or less, that big money doesn’t ensure goals. But the overall acceptance is of course, that richer clubs are expected to win more often as they got the money to buy the best players. This inspired me to start a data journey about market values in the big 5 European leagues: What do the market values tell about the development in the different leagues? How do teams perform in relation to the money they spent? Does the market value of a team has a predictive significance?
The hardest part of sports analytics is getting data! Not for nothing there are companies, which earn their money just with sports data. But if you are not able or do not want to pay such amounts of money, you got just one possibility: scraping the data from the Web. In an older post, I described a R web scraper. As this one was no longer working, I needed a new one. What brings us to this post. This time I will describe, how to create a web scrapper for static HTML sites with Python and how you are able to implement such a web scrapper as a User Defined Function (UDF) in Exasol.
Squawka changed the design of their website. The website no longer uses HTML tables to list the fixtures for a specific league and also changed the corresponding URLs. That’s why the described web scraper does not work anylonger. As soon I found another data source for the upcoming fixtures, I will create a new blog.
For the implementation of the Poisson prediction model I needed a data source for the current fixtures. As a temporary solution I used a manual CSV file, which I updated and imported regularly. During my researches for new data sources, I found the website squawka.com. This website provides statistics and analysis based on Opta data. With this post, I will describe, how you can extract the current fixtures from this website and use them during the normal data processing, which replaces the manual CSV interface.