When a rich club in Germany goes through a bad performance phase or loses an important match, we like to use the phrase “Geld schießt eben keine Tore”. What means more or less, that big money doesn’t ensure goals. But the overall acceptance is of course, that richer clubs are expected to win more often as they got the money to buy the best players. This inspired me to start a data journey about market values in the big 5 European leagues: What do the market values tell about the development in the different leagues? How do teams perform in relation to the money they spent? Does the market value of a team has a predictive significance?
Scraping market values
Before I am able to take a look at all these questions, I of course need the data. In an older post I already described, how to extract the current fixtures from the German football news site transfermarkt.com. This page will also be my source for the historic market values of all the teams.
Transfermarkt.com offers two different sites with market value data. The main league page contains the yearly values over the last decades. The detailed market value site provides the values 2 times a month, but only for the last years. For some basic analysis the yearly data should be enough. But the more detailed market values are of course way more interesting, as changes in the team performance also have an impact on the overall market value.
For scraping the values I used again the Python packages Request and BeautifulSoup.
import urllib.request from bs4 import BeautifulSoup as bs
In this example I only explain scraping the yearly values. But the logic is the same for the more detailed marketed values
v_url = 'http://www.transfermarkt.com/1-bundesliga/startseite/wettbewerb/L1'
We again have to create a HTTP header to not get blocked by the website. After reading the code, we can create a parseable Soup object,
#build header v_user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36' v_headers = {'User-Agent': v_user_agent} #load web site v_request = urllib.request.Request(url=v_url,headers=v_headers) v_response = urllib.request.urlopen(v_request) #create Soup object soup = bs(v_response, 'html.parser')
Both market value sites contain a drop down box, which provides a list with all available market value sub sites for the different seasons.
v_year_select = soup.find('div', attrs={'class':'inline-select'}) v_year_list = v_year_select.findAll('option')
To get all the data, you have to loop over all the sub sites. I added also a parameter for the number of looped entries, so that I do not have to parse all the data every time. For each sub site you again have to create a Soap object, which can be parsed for the single market values.
p_num_years = 2 #counter variable i=0 #loop over year list for v_year in v_year_list: v_season_id = (v_year["value"]) #build sub site URL v_url_year = v_url + '/plus/?saison_id=' + v_season_id #read sub site v_request_year = urllib.request.Request(url=v_url_year, headers=v_headers) v_response_year = urllib.request.urlopen(v_request_year) # soup parsing soup_year = bs(v_response_year, 'html.parser')
In the next step you determine the table, which contains the values
#read market value table v_year_select = soup.find('div', attrs={'class':'inline-select'}) v_year_list = v_year_select.findAll('option')
and loop over all the rows to extract the values
#loop over table rows for v_table_row in v_table_rows: v_table_columns = v_table_row.findAll('td') #extract values v_team = v_table_columns[2].text.strip() v_num_players = v_table_columns[3].text.strip() v_team_market_value = v_table_columns[8].text.strip() print(v_season_id,' ', v_team,' ', v_num_players,' ', v_team_market_value)
And if my desired number of years is reached, I stop parsing the sub sites
#last x years break i += 1 if i >= v_num_years: break
The print command shows following output:

The Python code for the seasonal and monthly market values can be found at my GitHub repository:
Web scraping Exasol UDF
As for the current fixtures this scraping logic should also be put into a UDF (User Defined Function) so that the scraping is executed directly inside the database. The approach is again the same.
Configuration data
A configuration table contains the division URLs for which the market values should be extracted. This one can be used for seasonal and the detailed market values. Only the URLs need to be changed.

The corresponding DDL statement and the URLs for the Big-5 European leagues can be found at GitHub:
UDF defintion
The division and the URL of the configuration table are passed as UDF parameters together with the limitation for the number of years, which was already introduced in the Python script.
The corresponding UDF head looks like this:
create or replace PYTHON3 scalar script stage.webscr_transfermarkt_market_values_season (p_division varchar(3), p_url varchar(100), p_num_years decimal(2,0)) emits (division varchar(5), season varchar(4), team varchar(50), num_players varchar(100), team_market_value varchar(100))
Following values are returned:
- Division
- Season
- Team
- Number of Players
- Team market value
Executing the UDF with the configuration data and a limitation of 1 season
select stage.WEBSCR_TRANSFERMARKT_MARKET_VALUES_SEASON(division, url, 1) from META.TRANSFERMARKT_MARKET_VALUE_SEASON_CONFIG ;
provides the same result set as the original Python script:

The complete UDF code can be found at GitHub:
Data integration
Now I got a list of many teams and their market values for several seasons. But to be able to associate the market values with the team performance, this data needs to be integrated with the already existing historic football match results.
Team mapping
At first the team names need to be synchronized for both sources. Of course there are methods like fuzzy match and such things, but the easiest way is using a manual mapping file, which matches the team names:
This one should just be imported in a separate table, so that it is available for usage.
Data transformation
The market value has to be converted in a real number. Real numbers are of course better usable for calculation as shortcuts for millions and billions.
case when instr(a.team_market_value,'Mio') > 0 then to_number(replace(substr(a.team_market_value,1,instr(a.team_market_value,'Mio')-2),',','.')) * 1000000 when instr(a.team_market_value,'Mrd') > 0 then to_number(replace(substr(a.team_market_value,1,instr(a.team_market_value,'Mrd')-2),',','.')) * 1000000000 else a.team_market_value end
Staging View
Such transformation and adaptions should be part of the stage layer. Following stage view executes the web scraper UDF and applies the transformations to the returned data.
Market value satellite
In the next step you load all the data into the Raw DV. At this level all data sources are integrated into one common Data Vault model. This simplifies the analysis of data of different data sources.
The market values are an additional attribute of a team. Attributes in a Data Vault model are stored in satellite tables. So we need to create a new satellite. On top the market value has a time dependency. It changes over time. This has to be modeled in the data. So you are able to analyse the data for any specific point in the past. That’s really import for the development of predictive models. All features for a model training should just contain data before the point of the simulated prediction.
I used following structure for the market value satellite:

The column FOOTBALL_TEAM_HID represents the reference to the team hub table. VALID_FROM indicates the time, since when the team had the specific market value. This results in a full history for each team.
Historic & current market values
When analyzing such a history, you need a range for the validity. A start date is not enough. You also need an end date. That’s why it’s best-practice to create to additional views for currently valid values and the complete history as time ranges.
The satellite view for the current values just filters the maximum valid date for each team.
select FOOTBALL_TEAM_HID, VALID_FROM, to_date('31.12.9999','dd.mm.yyyy') VALID_TO, NUM_PLAYERS, TEAM_MARKET_VALUE, LDTS from raw_dv.football_team_h_s_market_values_season where (FOOTBALL_TEAM_HID, VALID_FROM) in ( select football_team_hid, max(valid_from) from raw_dv.football_team_h_s_market_values_season group by football_team_hid ) ;
The historic view uses the analytical function LEAD to determine then VALID_FROM value of the next record in the team history. So ranges for the complete data set are created.
select FOOTBALL_TEAM_HID, VALID_FROM, nvl(lead(valid_from-1) over (partition by football_team_hid order by valid_from), to_date('31.12.9999','dd.mm.yyyy')) VALID_TO, NUM_PLAYERS, TEAM_MARKET_VALUE, LDTS from raw_dv.football_team_h_s_market_values_season ;
The corresponding DDL statements can be found at GitHub:
Loading satellite data
All data structures are now defined. So the data can be loaded from the stage table into the satellite table. The seasonal market values provided by transfermarkt.com apply to the date 1st, July of the season. This should be used as the VALID_FROM date. Every season, which was already loaded, does not need to be loaded again.
The loading script can also be found GitHub:
At this point the hard work is done and the fun, analyzing the data, can begin. But this will be part of the next blog post.
If you have further questions, feel free to leave a comment or contact me @Mo_Nbg.
5 Replies to “A data journey – market values (part1)”