Scraping FBRef xG data with Python

In the realm of sports analytics, FBRef has recently unveiled an essential update: the availability of Expected Goals (xG) data for previously uncovered divisions. This pragmatic addition opens doors to nuanced insights, particularly valuable in the world of sports betting. In this blog post, I pragmatically explore these developments. Using Python and adept web scraping techniques, I dissect this fresh data.

Setting the Stage

In this section, I navigate the structured landscape of FBRef, a pivotal hub for football data enthusiasts. FBRef meticulously organizes its vast repository, providing historic fixtures for diverse seasons and divisions—all accessible through a systematic URL format. Each division is assigned a unique numerical ID, enabling seamless navigation. For instance, a URL like https://fbref.com/en/comps/13/2022-2023/schedule/2022-2023-Ligue-1-Scores-and-Fixtures provides a detailed schedule for the 2022-2023 season of Division 13, also known as Ligue 1. Crucially, this structure is parameterized, allowing for automated exploration across multiple divisions and seasons. All divisions are conveniently cataloged at https://fbref.com/en/comps/. This organized structure empowers automated exploration across the footballing landscape.

For ,y exploration, I employ a straightforward configuration approach. Utilizing a simple configuration list:

lst_url_config = [         
            {"division" : "D2", "season" : "2018_2019", "url" : "https://fbref.com/en/comps/33/2018-2019/schedule/2018-2019-2-Bundesliga-Scores-and-Fixtures"},
            {"division" : "F1", "season" : "2022_2023", "url" : "https://fbref.com/en/comps/13/2022-2023/schedule/2022-2023-Ligue-1-Scores-and-Fixtures"}
            ]

The Python Solution: Mastering Web Scraping Techniques

In this section, I delve into the heart of our data-gathering expedition: the Python programming language. Python’s versatility and an array of powerful libraries make it the ideal choice for web scraping tasks. Armed with Python, I navigate the intricacies of FBRef’s site structure, extracting valuable football data efficiently and methodically.

1. Looping over the configuration

First of all, I need to loop over the configuration list to start the journeys. For the output I create an empty data frame, which at the end will hold all the final data.

# loop over all configurations
for url_config in lst_url_config:

    #create empty data frame
    columns = ['season', 'division', 'match_date', 'home_team', 'away_team', 'match_result', 'home_xg', 'away_xg']
    df_result_data = pd.DataFrame(columns=columns)
    

    # get config and load website
    v_url = url_config["url"]
    v_division = url_config["division"]
    v_season = url_config["season"]

2. Making HTTP Requests:

Our scraping journey begins with the requests library, enabling us to send HTTP requests to FBRef’s URLs. This library acts as our gateway, allowing us to access the structured data awaiting us on the web.

import requests

v_user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'
v_headers = {'User-Agent': v_user_agent}

response = requests.get(v_url , headers=headers)

3. Parsing HTML with BeautifulSoup:

With the raw HTML content in our hands, we turn to BeautifulSoup, a powerful library for parsing HTML and extracting data. It helps us navigate the HTML tree structure, allowing pinpoint precision in data extraction.

from bs4 import BeautifulSoup as bs

base_page = bs(response.content, 'html.parser')

4. Data Extraction and Processing

Using BeautifulSoup, the script locates the target HTML table with the html class “stats_table” and extracts relevant data such as team names, scores, xG values, and match dates. These data elements are meticulously assigned to columns in the df_result_data DataFrame.

filtered_page = base_page.find(name='table', attrs={'class': 'stats_table'})

if filtered_page:
    # Convert the BeautifulSoup object to a string
    table_html = str(filtered_page)

    Parse the HTML table into a DataFrame
    df_result_table = pd.read_html(table_html)[0]  # [0] is used to get the first table if here are multiple tables on the page

    add data to final data frame
    df_result_data['home_team'] = df_result_table['Home']
    df_result_data['away_team'] = df_result_table['Away']
            
    df_result_data['match_result'] = df_result_table['Score']            
            
    df_result_data['home_xg'] = df_result_table['xG']
    df_result_data['away_xg'] = df_result_table['xG.1']

    df_result_data['match_date'] = pd.to_datetime(df_result_table['Date'] + ' ' + df_result_table['Time'])

    df_result_data['season'] = v_season
    df_result_data['division'] = v_division

5. Data Cleaning

Rows containing missing values are removed from the DataFrame, ensuring the final dataset is clean and reliable.

df_result_data.dropna(inplace=True)

Conclusion

In the realm of sports betting and predictive modeling, the power of data cannot be overstated. Throughout this exploration, I’ve uncovered the intricacies of web scraping, unraveling the structured world of football data through the lens of FBRef. Key to my insights has been the integration of Expected Goals (xG) data, a foundational statistic that offers profound insights into the dynamics of a match.

With meticulous web scraping techniques and the versatility of Python, I’ve navigated the digital terrain, transforming raw HTML content into organized, actionable data. The integration of xG data into my predictive models serves as a cornerstone, providing me with a nuanced understanding of goal-scoring probabilities and team performances. In my hands, xG data becomes a compass, guiding decisions and strategies in the ever-shifting landscape of sports betting.

If you have further questions, feel free to leave a comment or contact me @Mo_Nbg.