Comparing the predictive power of different xG data providers

In the realm of sports betting, predictive analytics hinges on quality data, a challenge given the cost associated with many paid services. This article delves into the world of free xG (Expected Goals) data providers used by the BeatTheBookie services, assessing their predictive power for football betting, a critical aspect for enthusiasts who seek to enhance their strategies without breaking the bank.

What to Expect from Free Available Data?

While free xG data sources democratize sports analytics, understanding their predictive capabilities becomes paramount for making informed betting decisions.

What are xG?

In my older xG data journey post, I already took a look at the definition and usage of xG. So instead of writing again many sentences about this statistic, you can just follow the link or maybe just watch again this video, which I think explains the base idea really good.

Data Providers

Following data provides are used for the analysis

Understat.com:

FiveThirtyEight:

Footystats:

  • Offers xG data for Big5 and numerous minor leagues
  • Selected as a replacement for FiveThirtyEight

FBref:

  • Initially known for StatsBomb data, now utilizes Opta data for Big5 and minor leagues.
  • Encompasses additional leagues such as MLS, Brazil, Belgium, Ligue 2, and women’s football
  • Blog: Scraping FBRef xG Data With Python

Data Used for Comparison

For the purpose of this analysis, data spanning the seasons from 2018/19 to 2022/23 has been considered. It’s important to note that the current season has been excluded from the study due to the unavailability of data from FiveThirtyEight.

This analysis employs one key model for football betting predictions: the Vanilla xG Poisson model and an exploration of variance through different Exponential Moving Averages (EMAs). The Vanilla xG Poisson model serves as the foundation, assessing Expected Goals (xG) based on historical data. For a detailed understanding, refer to the [blog post]. Additionally, variance exploration through EMAs (5, 10, 15, 20, 30 last matches) provides insights into the dynamic nature of team performance.

Average Profit

Across all cases, the Big5 leagues consistently demonstrate higher predictability compared to the Minor Leagues. This suggests that the predictive power of models is generally more robust when applied to the top European football leagues. Notably, Understat emerges as the best performer in the Big5 leagues, providing the best average profit across all models. This underscores its reliability and effectiveness as a preferred choice for predictive analytics in the major European football competitions. In contrast, when analyzing Minor Leagues, FBref and Footystats exhibit similar performance. This indicates that both platforms offer comparable predictive power for football leagues outside the top-tier divisions. While these overarching trends provide initial guidance, a more detailed examination is warranted to uncover nuances and specific patterns within each data provider’s performance across different models and divisions.

Average profit per division

The detailed examination of average profit within the Big 5 leagues aligns with the overall average profit trends. The predictability across these top European leagues remains relatively consistent. Unsurprisingly, the Premier League stands out with the highest profit, a pattern observed consistently across various predictive models. This reinforces the Premier League’s status as a league with relatively high predictability, offering fruitful opportunities for informed betting. Serie A, on the other hand, emerges as the most challenging to predict within the Big 5. This finding underscores the complexity and unpredictability inherent in the Italian football landscape, presenting unique challenges for predictive modeling.

Notably, 2. Bundesliga diverges from the general trend observed in other minor leagues. Here, Footystats outperforms FBref, showcasing a unique dynamic in predictive accuracy. Further investigation into the factors influencing this distinction may provide valuable insights. For the remaining minor leagues, FBref maintains a slightly higher average profit compared to Footystats. This suggests that, in general, FBref demonstrates solid predictive capabilities for leagues outside the footballing powerhouses of the Big 5.

Average profit per model & division

Understanding the nuances of predictive success across different divisions and models is crucial for refining football betting strategies. In Ligue 1, the highest profit is achieved using Exponential Moving Averages (EMAs) of 5 and 10. Shorter-term indicators play a pivotal role in predicting outcomes within the French top-flight, showcasing the importance of recent team performance. Bundesliga presents a distinctive pattern with profit peaks at EMA 15 and EMA 20. However, what sets it apart is that Footystats also shares this peak, indicating a specific strength in predictive accuracy for Footystats at these intervals within the Bundesliga.

The trend of optimal predictive success at EMAs 5 and 10 extends to Eredivisie, the Dutch top-tier league. This suggests a shared reliance on recent team performance indicators for successful predictions. The Championship, in contrast, exhibits relative consistency across all models, indicating that the predictive power remains steady regardless of the chosen EMA. This stability provides a reliable landscape for betting strategies in this league. 2. Bundesliga diverges from the patterns observed in other leagues. While Footystats shows a positive profit, FiveThirtyEight and FBref do not exhibit the same trend. This divergence emphasizes the importance of choosing the right data provider, as Footystats stands out as a valuable resource for predictive modeling in Germany’s second-tier league.

Conclusions

After an extensive exploration into the realms of xG data providers, models, and their performance across diverse football leagues, the following key conclusions emerge:

Understat asserts its dominance as the premier data provider for the Big 5 European football leagues. With a consistent display of high average profit across various models, Understat stands as the go-to choice for enthusiasts seeking accurate insights into the elite competitions of the English Premier League, La Liga, Serie A, Bundesliga, and Ligue 1.

The landscape for minor leagues is nuanced, with no single data provider emerging as a clear favorite. A strategic combination of Footystats and FBref is recommended, acknowledging the diversity of predictive power each brings to the table. This blend ensures a comprehensive approach to navigating the challenges posed by leagues outside the footballing powerhouses.

A notable finding suggests that employing a varied set of moving averages for assessing past team performance might enhance the predictive power of models. Unlike the necessity for uniformity across divisions, exploring multiple moving averages tailored to the unique characteristics of each league could be a promising avenue for refining predictive models in future iterations.

If you have further questions, feel free to leave a comment or contact me @Mo_Nbg.

5 Replies to “Comparing the predictive power of different xG data providers”

  1. Very interesting analysis – I’ve thought about the same issue myself but never compared the xG providers.

    A question about Footystats – do they actaully provide true xG values? By this I mean values calculated from the quality of the chances in the game. The high number of leagues they provide numbers for (including extremely obscure leagues) suggests to me that they are really using a statistical model to estimate xG from other match stats, rather than based on the quality of chances in the game. That is still a useful measure but differentiates it from the other providers. The data they claim to provide would cost vast sums to acquire from the bigger operators.

    I’ve asked them where they get their data from but they did not reveal this.

    Like

    1. Unfortunately I do not know, how the different data provides calculate there xG values. It’s not only Footystats. It’s the same for Understat. Just for Fbref, as they are using Opta data, it’s sure, they use positional data to calcuate their metrics. I already worked with Opta data and also talked with a football club, which used this data in the past.

      Like

  2. Thanks for the reply. I use FBref for the Opta data, which I regard as “true” xG data as it’s based on positional shot information. I think this is what people usually mean with the term “xG”.

    I suspect that what FootyStats provides is a different measure – the output of model trained using post-match stats such as number of shots/corners/possession etc to predict the numbers of goals. That is why they are able to provide data on obscure leagues and divisions which no major operator provide the advanced statistical measures such as xG.

    Both measures are useful but I think distinct. It is also straightforward to calculate the type of “xG” that FootyStats provides, which negates the need for an additional data source feeding into the ML models. I calculate that measure (which I call “predicted goals” (pG) to differentiate it from xG) based on the match data from the usual sources.I haven’t had time to do a proper analysis on the relative effectiveness of each. Obviously there are highly correlated features, but at the very least I can generate artificial xG type data for leagues where no such data is recorded.

    Thanks again for your work on the xG sources.

    Like

    1. A really interesting approach creating such kind of xG data.

      Yes, having positional data as a base should always be favoured. But doing it this way should still reduce the noise of goal data enough to
      be more accurate.

      Like

      1. Another worry I have with FootyStats is that I don’t know what data they trained their model on. If matches in their training set overlap with my validation/testing dataset, this represents a data leak and may result in artificially positive results for my predictive models. By training my own model I can make sure the training data to generate my pG measure is the same as the training data for my main model, so no data leakage.

        Perhaps I’m overthinking it! Anyway, thanks for all your efforts and looking forward to your next blog post.

        Like

Leave a comment